Humanist Discussion Group

Humanist Archives: May 9, 2022, 6:44 a.m. Humanist 36.2 - working with unnormalised historical texts

              Humanist Discussion Group, Vol. 36, No. 2.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                Submit to:

        Date: 2022-05-08 19:55:51+00:00
        From: Nick Thieberger <>
        Subject: Re: [Humanist] 35.688: working with unnormalised historical texts?

Hi Gabor,

We have a similar issue in dealing with manuscripts of Australian
Indigenous languages collected in the past and written in various systems
that are often internally inconsistent.

We use a soundex search mechanism to look for known variations, based on a
sample set of sound correspondences (e.g. retroflex n written as rn, *n*,
nr; velar nasal written as ng, gn, kn; etc ) You can either generate the
forms in the documents themselves as alternate forms (stored in tei tags)
or create a search that looks for any of the possible forms in the original

See here:  the fuzzy search option.

As for the corpus itself, could you use the fuzzy search to take a word
from the source text, do a dictionary lookup in standard Italian that would
then provide word for word correspondences between the non-standard line of
text and a new line populated by that lookup?

All the best,


Assoc.Prof. Nick Thieberger FAHA
School of Languages and Linguistics
The University of Melbourne
Parkville, VIC 3010, Australia
+61 3 8344 8952
http:// <>
** <>

Director, Pacific and Regional Archive for Digital Sources in Endangered
Cultures (PARADISEC) <>
Deputy Director Research Unit for Indigenous Language
CI in the  ARC Centre of Excellence for the Dynamics of Language
Lead CI in Nyingarn, a platform for primary sources in Australian languages

On Fri, 6 May 2022 at 14:33, Humanist <> wrote:

>               Humanist Discussion Group, Vol. 35, No. 688.
>         Department of Digital Humanities, University of Cologne
>                       Hosted by DH-Cologne
>                 Submit to:
>         Date: 2022-05-05 16:49:15+00:00
>         From: Gabor Toth <>
>         Subject: Working with not normalized historical texts
> Dear Colleagues,
> We work with a corpus of - transcribed - Italian handwritten news-sheets
> from the early modern period. We would like to apply tools of corpus
> linguistics and text analytics to study the transcriptions (~ 1 million
> words).
> However, the transcriptions were not normalized and they feature a lot of
> orthographical variations.
> I am writing to ask about current best practices of working with
> non-normalized historical corpora. I am interested in how other projects
> tackle the problem of normalization if the transcription phase is already
> complete.
> For instance, when working with smaller datasets I extracted all types and
> then built - manually - a dictionary with variants and normalized word
> forms.
> But given that now we have approximately 50.000 - unique - types, this does
> not seem to be feasible.
> Best wishes,
> Gabor

Unsubscribe at:
List posts to:
List info and archives at at:
Listmember interface at:
Subscribe at: