        Date: 2022-05-05 16:49:15+00:00
        From: Gabor Toth <>
        Subject: Working with not normalized historical texts

Dear Colleagues,

We work with a corpus of - transcribed - Italian handwritten news-sheets
from the early modern period. We would like to apply tools of corpus
linguistics and text analytics to study the transcriptions (~ 1 million

However, the transcriptions were not normalized and they feature a lot of
orthographical variations.

I am writing to ask about current best practices of working with
non-normalized historical corpora. I am interested in how other projects
tackle the problem of normalization if the transcription phase is already

For instance, when working with smaller datasets I extracted all types and
then built - manually - a dictionary with variants and normalized word

But given that now we have approximately 50.000 - unique - types, this does
not seem to be feasible.

Best wishes,


