Humanist Discussion Group

Humanist Archives: May 15, 2022, 8:19 a.m. Humanist 36.11 - working with unnormalised historical texts

              Humanist Discussion Group, Vol. 36, No. 11.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                Submit to:

        Date: 2022-05-14 14:50:35+00:00
        From: Crystal Hall <>
        Subject: Re: [Humanist] 36.2: working with unnormalised historical texts

Dear Gabor,

Please pardon the slow response during the last week of our academic term. I
have taken a hybrid manual-automated approach to solving this problem for my
work on Galileo’s library (currently ~340,000 word types), and I am curious to
know what you and others might think of this method. I have made a model for
annotating words in early modern texts (rather than a model of early modern
Italian language) because so many of the texts that I study actively resist
regular syntax as part of the way that they make meaning, let alone adopt loose
(if any) conventions of punctuation, capitalization, and orthography. As you
pointed out, the usual features that would help to infer parts of speech with
any natural language processing are inconsistent, to say the least.

I began with the entries in the 1612 Vocabolario della Crusca (for the non-
Italianists who are still reading, this billed itself as the first authoritative
Italian dictionary). I hand coded the entries for part of speech using a
simplified Penn Tree Bank system, marked known irregulars for hand processing,
and built out the other possible variations algorithmically. Importantly, the
types that can have different parts of speech have an aggregated tag so that,
for example, there is no inference about what role “ci” plays in a sentence – it
always carries its multiple possible parts of speech in the tag. Each generated
word type is paired with its original entry to assist with lemmatization. The
process creates forms that were never used in the hopes of creating most of the
forms that were. The result was a first draft of over 3 million forms (beyond
estimates of spoken and written terms in modern Italian) that tagged up to 77%
of the word types in the 82 book-length documents in my growing corpus. Given
that 20-90% of any one of those texts is made up of types that occur nowhere
else in the corpus, this felt like a good start, certainly better than the
results I had with the available NLP models of Italian that I found.

It has been most interesting to see where this approach is completely
insufficient – lyric poetry, translations, and ephemera like orations mostly by
authors who tried unsuccessfully to find princely patrons. The combination of
this model and my particular corpus captures some of the extent of linguistic
gatekeeping in the period. As expected, the model works best on the academic,
court-based literature (the milieu of the dictionary creators and their
sources). My plan is to expand it incrementally by adding more types from the
documents, starting with those with the lowest percentage of matches. Unlike
other processes, growing this model will not change existing tags unless I find
an error.

It seems like our needs might be complementary and that we might be able to
assist one another iteratively build something more comprehensive? Or perhaps
you have already learned of a better  method? Please feel free to be in touch
directly via email.

Best wishes,

Crystal Hall (she/her)
Director, Digital and Computational Studies<
Affiliated faculty, Italian Studies
Bowdoin College
9300 College Station
Brunswick, ME 04011<>

Prof. Crystal Hall (she/her)
Director, Digital & Computational Studies
310 Visual Arts Center
Virtual meeting scheduling:


>         Date: 2022-05-05 16:49:15+00:00
>         From: Gabor Toth <>
>         Subject: Working with not normalized historical texts
> Dear Colleagues,
> We work with a corpus of - transcribed - Italian handwritten news-sheets
> from the early modern period. We would like to apply tools of corpus
> linguistics and text analytics to study the transcriptions (~ 1 million
> words).
> However, the transcriptions were not normalized and they feature a lot of
> orthographical variations.
> I am writing to ask about current best practices of working with
> non-normalized historical corpora. I am interested in how other projects
> tackle the problem of normalization if the transcription phase is already
> complete.
> For instance, when working with smaller datasets I extracted all types and
> then built - manually - a dictionary with variants and normalized word
> forms.
> But given that now we have approximately 50.000 - unique - types, this does
> not seem to be feasible.
> Best wishes,
> Gabor

Unsubscribe at:
List posts to:
List info and archives at at:
Listmember interface at:
Subscribe at: