Humanist Discussion Group, Vol. 36, No. 2. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2022-05-08 19:55:51+00:00 From: Nick Thieberger <thien@unimelb.edu.au> Subject: Re: [Humanist] 35.688: working with unnormalised historical texts? Hi Gabor, We have a similar issue in dealing with manuscripts of Australian Indigenous languages collected in the past and written in various systems that are often internally inconsistent. We use a soundex search mechanism to look for known variations, based on a sample set of sound correspondences (e.g. retroflex n written as rn, *n*, nr; velar nasal written as ng, gn, kn; etc ) You can either generate the forms in the documents themselves as alternate forms (stored in tei tags) or create a search that looks for any of the possible forms in the original documents. See here: http://bates.org.au/search/ the fuzzy search option. As for the corpus itself, could you use the fuzzy search to take a word from the source text, do a dictionary lookup in standard Italian that would then provide word for word correspondences between the non-standard line of text and a new line populated by that lookup? All the best, Nick *********************** Assoc.Prof. Nick Thieberger FAHA School of Languages and Linguistics The University of Melbourne Parkville, VIC 3010, Australia +61 3 8344 8952 http:// <http://languages-linguistics.unimelb.edu.au/thieberger> *nthieberger.net* <http://nthieberger.net/> Director, Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) <http://paradisec.org.au/> Deputy Director Research Unit for Indigenous Language <https://arts.unimelb.edu.au/research-unit-for-indigenous-language> CI in the ARC Centre of Excellence for the Dynamics of Language <http://www.dynamicsoflanguage.edu.au/> Lead CI in Nyingarn, a platform for primary sources in Australian languages <Https://nyingarn.net> On Fri, 6 May 2022 at 14:33, Humanist <humanist@dhhumanist.org> wrote: > > Humanist Discussion Group, Vol. 35, No. 688. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > http://www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2022-05-05 16:49:15+00:00 > From: Gabor Toth <gabor.toth@maximilianeum.de> > Subject: Working with not normalized historical texts > > Dear Colleagues, > > We work with a corpus of - transcribed - Italian handwritten news-sheets > from the early modern period. We would like to apply tools of corpus > linguistics and text analytics to study the transcriptions (~ 1 million > words). > > However, the transcriptions were not normalized and they feature a lot of > orthographical variations. > > I am writing to ask about current best practices of working with > non-normalized historical corpora. I am interested in how other projects > tackle the problem of normalization if the transcription phase is already > complete. > > For instance, when working with smaller datasets I extracted all types and > then built - manually - a dictionary with variants and normalized word > forms. > > But given that now we have approximately 50.000 - unique - types, this does > not seem to be feasible. > > Best wishes, > > Gabor _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php