Humanist Archives: May 6, 2022, 5:33 a.m. Humanist 35.688 - working with unnormalised historical texts?
Humanist Discussion Group, Vol. 35, No. 688.
Department of Digital Humanities, University of Cologne
Hosted by DH-Cologne
Submit to: email@example.com
Date: 2022-05-05 16:49:15+00:00
From: Gabor Toth <firstname.lastname@example.org>
Subject: Working with not normalized historical texts
We work with a corpus of - transcribed - Italian handwritten news-sheets
from the early modern period. We would like to apply tools of corpus
linguistics and text analytics to study the transcriptions (~ 1 million
However, the transcriptions were not normalized and they feature a lot of
I am writing to ask about current best practices of working with
non-normalized historical corpora. I am interested in how other projects
tackle the problem of normalization if the transcription phase is already
For instance, when working with smaller datasets I extracted all types and
then built - manually - a dictionary with variants and normalized word
But given that now we have approximately 50.000 - unique - types, this does
not seem to be feasible.
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: email@example.com
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php