Humanist Archives: May 9, 2022, 6:44 a.m. Humanist 36.2 - working with unnormalised historical texts
Humanist Discussion Group, Vol. 36, No. 2.
Department of Digital Humanities, University of Cologne
Hosted by DH-Cologne
Submit to: email@example.com
Date: 2022-05-08 19:55:51+00:00
From: Nick Thieberger <firstname.lastname@example.org>
Subject: Re: [Humanist] 35.688: working with unnormalised historical texts?
We have a similar issue in dealing with manuscripts of Australian
Indigenous languages collected in the past and written in various systems
that are often internally inconsistent.
We use a soundex search mechanism to look for known variations, based on a
sample set of sound correspondences (e.g. retroflex n written as rn, *n*,
nr; velar nasal written as ng, gn, kn; etc ) You can either generate the
forms in the documents themselves as alternate forms (stored in tei tags)
or create a search that looks for any of the possible forms in the original
See here: http://bates.org.au/search/ the fuzzy search option.
As for the corpus itself, could you use the fuzzy search to take a word
from the source text, do a dictionary lookup in standard Italian that would
then provide word for word correspondences between the non-standard line of
text and a new line populated by that lookup?
All the best,
Assoc.Prof. Nick Thieberger FAHA
School of Languages and Linguistics
The University of Melbourne
Parkville, VIC 3010, Australia
+61 3 8344 8952
Director, Pacific and Regional Archive for Digital Sources in Endangered
Cultures (PARADISEC) <http://paradisec.org.au/>
Deputy Director Research Unit for Indigenous Language
CI in the ARC Centre of Excellence for the Dynamics of Language
Lead CI in Nyingarn, a platform for primary sources in Australian languages
On Fri, 6 May 2022 at 14:33, Humanist <email@example.com> wrote:
> Humanist Discussion Group, Vol. 35, No. 688.
> Department of Digital Humanities, University of Cologne
> Hosted by DH-Cologne
> Submit to: firstname.lastname@example.org
> Date: 2022-05-05 16:49:15+00:00
> From: Gabor Toth <email@example.com>
> Subject: Working with not normalized historical texts
> Dear Colleagues,
> We work with a corpus of - transcribed - Italian handwritten news-sheets
> from the early modern period. We would like to apply tools of corpus
> linguistics and text analytics to study the transcriptions (~ 1 million
> However, the transcriptions were not normalized and they feature a lot of
> orthographical variations.
> I am writing to ask about current best practices of working with
> non-normalized historical corpora. I am interested in how other projects
> tackle the problem of normalization if the transcription phase is already
> For instance, when working with smaller datasets I extracted all types and
> then built - manually - a dictionary with variants and normalized word
> But given that now we have approximately 50.000 - unique - types, this does
> not seem to be feasible.
> Best wishes,
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: firstname.lastname@example.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php