Humanist Discussion Group

				
              Humanist Discussion Group, Vol. 36, No. 2.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2022-05-08 19:55:51+00:00
        From: Nick Thieberger <thien@unimelb.edu.au>
        Subject: Re: [Humanist] 35.688: working with unnormalised historical texts?

Hi Gabor,

We have a similar issue in dealing with manuscripts of Australian
Indigenous languages collected in the past and written in various systems
that are often internally inconsistent.

We use a soundex search mechanism to look for known variations, based on a
sample set of sound correspondences (e.g. retroflex n written as rn, *n*,
nr; velar nasal written as ng, gn, kn; etc ) You can either generate the
forms in the documents themselves as alternate forms (stored in tei tags)
or create a search that looks for any of the possible forms in the original
documents.

See here: http://bates.org.au/search/  the fuzzy search option.

As for the corpus itself, could you use the fuzzy search to take a word
from the source text, do a dictionary lookup in standard Italian that would
then provide word for word correspondences between the non-standard line of
text and a new line populated by that lookup?

All the best,

Nick

***********************
Assoc.Prof. Nick Thieberger FAHA
School of Languages and Linguistics
The University of Melbourne
Parkville, VIC 3010, Australia
+61 3 8344 8952
http:// <http://languages-linguistics.unimelb.edu.au/thieberger>
*nthieberger.net* <http://nthieberger.net/>

Director, Pacific and Regional Archive for Digital Sources in Endangered
Cultures (PARADISEC) <http://paradisec.org.au/>
Deputy Director Research Unit for Indigenous Language
<https://arts.unimelb.edu.au/research-unit-for-indigenous-language>
CI in the  ARC Centre of Excellence for the Dynamics of Language
<http://www.dynamicsoflanguage.edu.au/>
Lead CI in Nyingarn, a platform for primary sources in Australian languages
<Https://nyingarn.net>


On Fri, 6 May 2022 at 14:33, Humanist <humanist@dhhumanist.org> wrote:

>
>               Humanist Discussion Group, Vol. 35, No. 688.
>         Department of Digital Humanities, University of Cologne
>                       Hosted by DH-Cologne
>                        http://www.dhhumanist.org
>                 Submit to: humanist@dhhumanist.org
>
>
>
>
>         Date: 2022-05-05 16:49:15+00:00
>         From: Gabor Toth <gabor.toth@maximilianeum.de>
>         Subject: Working with not normalized historical texts
>
> Dear Colleagues,
>
> We work with a corpus of - transcribed - Italian handwritten news-sheets
> from the early modern period. We would like to apply tools of corpus
> linguistics and text analytics to study the transcriptions (~ 1 million
> words).
>
> However, the transcriptions were not normalized and they feature a lot of
> orthographical variations.
>
> I am writing to ask about current best practices of working with
> non-normalized historical corpora. I am interested in how other projects
> tackle the problem of normalization if the transcription phase is already
> complete.
>
> For instance, when working with smaller datasets I extracted all types and
> then built - manually - a dictionary with variants and normalized word
> forms.
>
> But given that now we have approximately 50.000 - unique - types, this does
> not seem to be feasible.
>
> Best wishes,
>
> Gabor



_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php