Humanist Discussion Group

				
              Humanist Discussion Group, Vol. 35, No. 688.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2022-05-05 16:49:15+00:00
        From: Gabor Toth <gabor.toth@maximilianeum.de>
        Subject: Working with not normalized historical texts

Dear Colleagues,

We work with a corpus of - transcribed - Italian handwritten news-sheets
from the early modern period. We would like to apply tools of corpus
linguistics and text analytics to study the transcriptions (~ 1 million
words).

However, the transcriptions were not normalized and they feature a lot of
orthographical variations.

I am writing to ask about current best practices of working with
non-normalized historical corpora. I am interested in how other projects
tackle the problem of normalization if the transcription phase is already
complete.

For instance, when working with smaller datasets I extracted all types and
then built - manually - a dictionary with variants and normalized word
forms.

But given that now we have approximately 50.000 - unique - types, this does
not seem to be feasible.

Best wishes,

Gabor


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php