Humanist Discussion Group, Vol. 35, No. 688. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2022-05-05 16:49:15+00:00 From: Gabor Toth <gabor.toth@maximilianeum.de> Subject: Working with not normalized historical texts Dear Colleagues, We work with a corpus of - transcribed - Italian handwritten news-sheets from the early modern period. We would like to apply tools of corpus linguistics and text analytics to study the transcriptions (~ 1 million words). However, the transcriptions were not normalized and they feature a lot of orthographical variations. I am writing to ask about current best practices of working with non-normalized historical corpora. I am interested in how other projects tackle the problem of normalization if the transcription phase is already complete. For instance, when working with smaller datasets I extracted all types and then built - manually - a dictionary with variants and normalized word forms. But given that now we have approximately 50.000 - unique - types, this does not seem to be feasible. Best wishes, Gabor _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php