Humanist Discussion Group, Vol. 36, No. 31. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2022-05-25 08:20:12+00:00 From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk> Subject: Re: [Humanist] 36.19: working with unnormalised historical texts Dear Gabor et al, We looked into approaches to this question in a general way in a workshop at the BBAW in Berlin in 2019, and I wrote a short blog post [1] reporting on it. Unfortunately the planned hackathon to follow up on the event was a victim of the pandemic, but could be revived if there is sufficient interest. The main outcome of the workshop was a guide to tools for normalisation applications, in the form of a 'CLARIN Resource Family' [2]. This and other 'Resource Families' are subject to ongoing curation, and suggestions for updates and corrections are very welcome. Best wishes, Martin [1] https://sprache.hypotheses.org/1790 [2] https://www.clarin.eu/resource-families/tools-normalisation On 18/05/2022 07:49, Humanist wrote: > Humanist Discussion Group, Vol. 36, No. 19. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2022-05-17 13:02:30+00:00 > From: Crystal Hall <chall@bowdoin.edu> > Subject: Re: [Humanist] 36.17: working with unnormalised historical texts > > Dear Gabor, > > Thank you for asking about these points. Since I have just been using this > personally, clarifications will hopefully make it more useful eventually as an > open data set. I always wonder if my years of studying Galileo have made me too > open to unorthodox approaches, so conversation is a helpful guide. > > I hesitated to call the result types because they are really just potential word > forms. To me, a type is a pattern of characters that is actually found in a > text. Most of the potential word forms were generated algorithmically using the > rules of Italian grammar. For example, I wrote code to conjugate the verbs in > the original data in 7 simple tenses and moods plus creating the gerund, > participle, and present progressive. The hand processing in your first question > refers to excepting known irregular verbs from the algorithmic process for > regular verbs. I created variations of the overall process to account for > orthography (e.g. gn for ng) and certain families of verbs (e.g. fare, dire, > -rre, etc.). I also automatically created abbreviated forms (e.g. that drop the > terminal -o) and added a series of pronoun combinations to the infinitives, > gerunds, and participles following conventions at the time. Galileo was > notorious for this practice. Each of the 5,000+ root verbs exists in my data in > over 200 different forms that could be types, but some would never exist in a > text because they defy usage rules. There is currently some bloat in how the > participle variations are created, which I would tidy if I were to publish the > data. Your email inquiry arrived just as I was starting to work with the draft > results. As a literary scholar trying to understand patterns of expression, my > priority was to tag/annotate the types in my texts (creating models of specific > instances of language) rather than create a comprehensive linguistic model of > the period (which feels more like the realm of linguists). > > Further examination of the types in the documents that are not covered by this > data will improve the workflow and tagging rate over time. > > Best wishes, > Crystal > > From: Humanist <humanist@dhhumanist.org> > Date: Tuesday, May 17, 2022 at 1:18 AM > To: Crystal Hall <chall@bowdoin.edu> > Subject: [Humanist] 36.17: working with unnormalised historical texts > > Humanist Discussion Group, Vol. 36, No. 17. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > > > Date: 2022-05-16 13:46:31+00:00 > From: Gabor Toth <gabor.toth@maximilianeum.de> > Subject: Re: [Humanist] 36.12: working with unnormalised historical > texts > > Dear Crystal, > > Many thanks for your detailed answer; congratulations for working out all > this, which sounds great. > > Could you please explain what you mean by the following two points: > > 1. "I hand coded the entries for part of speech using a > simplified Penn Tree Bank system, marked known irregulars for hand > processing, and built out the other possible variations algorithmically." > > I understand the hand coding part but I am not sure about the following > steps. > > 2. "The result was a first draft of over 3 million forms" > > That sounds like a very big number, by forms do you mean types? > > Cheers, > > Gabor -- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php