Humanist Discussion Group

Humanist Archives: May 6, 2022, 5:33 a.m. Humanist 35.688 - working with unnormalised historical texts?

              Humanist Discussion Group, Vol. 35, No. 688.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                Submit to:

        Date: 2022-05-05 16:49:15+00:00
        From: Gabor Toth <>
        Subject: Working with not normalized historical texts

Dear Colleagues,

We work with a corpus of - transcribed - Italian handwritten news-sheets
from the early modern period. We would like to apply tools of corpus
linguistics and text analytics to study the transcriptions (~ 1 million

However, the transcriptions were not normalized and they feature a lot of
orthographical variations.

I am writing to ask about current best practices of working with
non-normalized historical corpora. I am interested in how other projects
tackle the problem of normalization if the transcription phase is already

For instance, when working with smaller datasets I extracted all types and
then built - manually - a dictionary with variants and normalized word

But given that now we have approximately 50.000 - unique - types, this does
not seem to be feasible.

Best wishes,


Unsubscribe at:
List posts to:
List info and archives at at:
Listmember interface at:
Subscribe at: