Humanist Archives: May 18, 2022, 7:49 a.m. Humanist 36.19 - working with unnormalised historical texts

              Humanist Discussion Group, Vol. 36, No. 19.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                Submit to:

        Date: 2022-05-17 13:02:30+00:00
        From: Crystal Hall
        Subject: Re: [Humanist] 36.17: working with unnormalised historical texts

Dear Gabor,

Thank you for asking about these points. Since I have just been using this
personally, clarifications will hopefully make it more useful eventually as an
open data set. I always wonder if my years of studying Galileo have made me too
open to unorthodox approaches, so conversation is a helpful guide.

I hesitated to call the result types because they are really just potential word
forms. To me, a type is a pattern of characters that is actually found in a
text. Most of the potential word forms were generated algorithmically using the
rules of Italian grammar. For example, I wrote code to conjugate the verbs in
the original data in 7 simple tenses and moods plus creating the gerund,
participle, and present progressive. The hand processing in your first question
refers to excepting known irregular verbs from the algorithmic process for
regular verbs. I created variations of the overall process to account for
orthography (e.g. gn for ng) and certain families of verbs (e.g. fare, dire,
-rre, etc.). I also automatically created abbreviated forms (e.g. that drop the
terminal -o) and added a series of pronoun combinations to the infinitives,
gerunds, and participles following conventions at the time. Galileo was
notorious for this practice. Each of the 5,000+ root verbs exists in my data in
over 200 different forms that could be types, but some would never exist in a
text because they defy usage rules. There is currently some bloat in how the
participle variations are created, which I would tidy if I were to publish the
data. Your email inquiry arrived just as I was starting to work with the draft
results. As a literary scholar trying to understand patterns of expression, my
priority was to tag/annotate the types in my texts (creating models of specific
instances of language) rather than create a comprehensive linguistic model of
the period (which feels more like the realm of linguists).

Further examination of the types in the documents that are not covered by this
data will improve the workflow and tagging rate over time.

Best wishes,

From: Humanist
Date: Tuesday, May 17, 2022 at 1:18 AM
To: Crystal Hall <>
Subject: [Humanist] 36.17: working with unnormalised historical texts

              Humanist Discussion Group, Vol. 36, No. 17.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne

        Date: 2022-05-16 13:46:31+00:00
        From: Gabor Toth
        Subject: Re: [Humanist] 36.12: working with unnormalised historical

Dear Crystal,

Many thanks for your detailed answer; congratulations for working out all
this, which sounds great.

Could you please explain what you mean by the following two points:

1. "I hand coded the entries for part of speech using a
simplified Penn Tree Bank system, marked known irregulars for hand
processing, and built out the other possible variations algorithmically."

I understand the hand coding part but I am not sure about the following

2. "The result was a first draft of over 3 million forms"

That sounds like a very big number, by forms do you mean types?



