Humanist Discussion Group, Vol. 36, No. 11. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2022-05-14 14:50:35+00:00 From: Crystal Hall <chall@bowdoin.edu> Subject: Re: [Humanist] 36.2: working with unnormalised historical texts Dear Gabor, Please pardon the slow response during the last week of our academic term. I have taken a hybrid manual-automated approach to solving this problem for my work on Galileo’s library (currently ~340,000 word types), and I am curious to know what you and others might think of this method. I have made a model for annotating words in early modern texts (rather than a model of early modern Italian language) because so many of the texts that I study actively resist regular syntax as part of the way that they make meaning, let alone adopt loose (if any) conventions of punctuation, capitalization, and orthography. As you pointed out, the usual features that would help to infer parts of speech with any natural language processing are inconsistent, to say the least. I began with the entries in the 1612 Vocabolario della Crusca (for the non- Italianists who are still reading, this billed itself as the first authoritative Italian dictionary). I hand coded the entries for part of speech using a simplified Penn Tree Bank system, marked known irregulars for hand processing, and built out the other possible variations algorithmically. Importantly, the types that can have different parts of speech have an aggregated tag so that, for example, there is no inference about what role “ci” plays in a sentence – it always carries its multiple possible parts of speech in the tag. Each generated word type is paired with its original entry to assist with lemmatization. The process creates forms that were never used in the hopes of creating most of the forms that were. The result was a first draft of over 3 million forms (beyond estimates of spoken and written terms in modern Italian) that tagged up to 77% of the word types in the 82 book-length documents in my growing corpus. Given that 20-90% of any one of those texts is made up of types that occur nowhere else in the corpus, this felt like a good start, certainly better than the results I had with the available NLP models of Italian that I found. It has been most interesting to see where this approach is completely insufficient – lyric poetry, translations, and ephemera like orations mostly by authors who tried unsuccessfully to find princely patrons. The combination of this model and my particular corpus captures some of the extent of linguistic gatekeeping in the period. As expected, the model works best on the academic, court-based literature (the milieu of the dictionary creators and their sources). My plan is to expand it incrementally by adding more types from the documents, starting with those with the lowest percentage of matches. Unlike other processes, growing this model will not change existing tags unless I find an error. It seems like our needs might be complementary and that we might be able to assist one another iteratively build something more comprehensive? Or perhaps you have already learned of a better method? Please feel free to be in touch directly via email. Best wishes, Crystal Crystal Hall (she/her) Director, Digital and Computational Studies<https://www.bowdoin.edu/digital-and- computational-studies/> Affiliated faculty, Italian Studies Bowdoin College 9300 College Station Brunswick, ME 04011 profhall.net<http://profhall.net/> Prof. Crystal Hall (she/her) Director, Digital & Computational Studies 310 Visual Arts Center Virtual meeting scheduling: https://calendly.com/prof-chall ---------------------------------------------------------- > > Date: 2022-05-05 16:49:15+00:00 > From: Gabor Toth <gabor.toth@maximilianeum.de> > Subject: Working with not normalized historical texts > > Dear Colleagues, > > We work with a corpus of - transcribed - Italian handwritten news-sheets > from the early modern period. We would like to apply tools of corpus > linguistics and text analytics to study the transcriptions (~ 1 million > words). > > However, the transcriptions were not normalized and they feature a lot of > orthographical variations. > > I am writing to ask about current best practices of working with > non-normalized historical corpora. I am interested in how other projects > tackle the problem of normalization if the transcription phase is already > complete. > > For instance, when working with smaller datasets I extracted all types and > then built - manually - a dictionary with variants and normalized word > forms. > > But given that now we have approximately 50.000 - unique - types, this does > not seem to be feasible. > > Best wishes, > > Gabor _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php