Humanist Discussion Group, Vol. 36, No. 19. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2022-05-17 13:02:30+00:00 From: Crystal Hall <chall@bowdoin.edu> Subject: Re: [Humanist] 36.17: working with unnormalised historical texts Dear Gabor, Thank you for asking about these points. Since I have just been using this personally, clarifications will hopefully make it more useful eventually as an open data set. I always wonder if my years of studying Galileo have made me too open to unorthodox approaches, so conversation is a helpful guide. I hesitated to call the result types because they are really just potential word forms. To me, a type is a pattern of characters that is actually found in a text. Most of the potential word forms were generated algorithmically using the rules of Italian grammar. For example, I wrote code to conjugate the verbs in the original data in 7 simple tenses and moods plus creating the gerund, participle, and present progressive. The hand processing in your first question refers to excepting known irregular verbs from the algorithmic process for regular verbs. I created variations of the overall process to account for orthography (e.g. gn for ng) and certain families of verbs (e.g. fare, dire, -rre, etc.). I also automatically created abbreviated forms (e.g. that drop the terminal -o) and added a series of pronoun combinations to the infinitives, gerunds, and participles following conventions at the time. Galileo was notorious for this practice. Each of the 5,000+ root verbs exists in my data in over 200 different forms that could be types, but some would never exist in a text because they defy usage rules. There is currently some bloat in how the participle variations are created, which I would tidy if I were to publish the data. Your email inquiry arrived just as I was starting to work with the draft results. As a literary scholar trying to understand patterns of expression, my priority was to tag/annotate the types in my texts (creating models of specific instances of language) rather than create a comprehensive linguistic model of the period (which feels more like the realm of linguists). Further examination of the types in the documents that are not covered by this data will improve the workflow and tagging rate over time. Best wishes, Crystal From: Humanist <humanist@dhhumanist.org> Date: Tuesday, May 17, 2022 at 1:18 AM To: Crystal Hall <chall@bowdoin.edu> Subject: [Humanist] 36.17: working with unnormalised historical texts Humanist Discussion Group, Vol. 36, No. 17. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne Date: 2022-05-16 13:46:31+00:00 From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Re: [Humanist] 36.12: working with unnormalised historical texts Dear Crystal, Many thanks for your detailed answer; congratulations for working out all this, which sounds great. Could you please explain what you mean by the following two points: 1. "I hand coded the entries for part of speech using a simplified Penn Tree Bank system, marked known irregulars for hand processing, and built out the other possible variations algorithmically." I understand the hand coding part but I am not sure about the following steps. 2. "The result was a first draft of over 3 million forms" That sounds like a very big number, by forms do you mean types? Cheers, Gabor _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php