Humanist Discussion Group

Humanist Archives: May 26, 2022, 7:05 a.m. Humanist 36.31 - working with unnormalised historical texts

				
              Humanist Discussion Group, Vol. 36, No. 31.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2022-05-25 08:20:12+00:00
        From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk>
        Subject: Re: [Humanist] 36.19: working with unnormalised historical texts

Dear Gabor et al,

We looked into approaches to this question in a general way in a
workshop at the BBAW in Berlin in 2019, and I wrote a short blog post
[1] reporting on it. Unfortunately the planned hackathon to follow up on
the event was a victim of the pandemic, but could be revived if there is
sufficient interest. The main outcome of the workshop was a guide to
tools for normalisation applications, in the form of a 'CLARIN Resource
Family' [2]. This and other 'Resource Families' are subject to ongoing
curation, and suggestions for updates and corrections are very welcome.

Best wishes,
Martin

[1] https://sprache.hypotheses.org/1790
[2] https://www.clarin.eu/resource-families/tools-normalisation

On 18/05/2022 07:49, Humanist wrote:
>                Humanist Discussion Group, Vol. 36, No. 19.
>          Department of Digital Humanities, University of Cologne
>                        Hosted by DH-Cologne
>                         www.dhhumanist.org
>                  Submit to: humanist@dhhumanist.org
>
>
>
>
>          Date: 2022-05-17 13:02:30+00:00
>          From: Crystal Hall <chall@bowdoin.edu>
>          Subject: Re: [Humanist] 36.17: working with unnormalised historical
texts
>
> Dear Gabor,
>
> Thank you for asking about these points. Since I have just been using this
> personally, clarifications will hopefully make it more useful eventually as an
> open data set. I always wonder if my years of studying Galileo have made me
too
> open to unorthodox approaches, so conversation is a helpful guide.
>
> I hesitated to call the result types because they are really just potential
word
> forms. To me, a type is a pattern of characters that is actually found in a
> text. Most of the potential word forms were generated algorithmically using
the
> rules of Italian grammar. For example, I wrote code to conjugate the verbs in
> the original data in 7 simple tenses and moods plus creating the gerund,
> participle, and present progressive. The hand processing in your first
question
> refers to excepting known irregular verbs from the algorithmic process for
> regular verbs. I created variations of the overall process to account for
> orthography (e.g. gn for ng) and certain families of verbs (e.g. fare, dire,
> -rre, etc.). I also automatically created abbreviated forms (e.g. that drop
the
> terminal -o) and added a series of pronoun combinations to the infinitives,
> gerunds, and participles following conventions at the time. Galileo was
> notorious for this practice. Each of the 5,000+ root verbs exists in my data
in
> over 200 different forms that could be types, but some would never exist in a
> text because they defy usage rules. There is currently some bloat in how the
> participle variations are created, which I would tidy if I were to publish the
> data. Your email inquiry arrived just as I was starting to work with the draft
> results. As a literary scholar trying to understand patterns of expression, my
> priority was to tag/annotate the types in my texts (creating models of
specific
> instances of language) rather than create a comprehensive linguistic model of
> the period (which feels more like the realm of linguists).
>
> Further examination of the types in the documents that are not covered by this
> data will improve the workflow and tagging rate over time.
>
> Best wishes,
> Crystal
>
> From: Humanist <humanist@dhhumanist.org>
> Date: Tuesday, May 17, 2022 at 1:18 AM
> To: Crystal Hall <chall@bowdoin.edu>
> Subject: [Humanist] 36.17: working with unnormalised historical texts
>
>                Humanist Discussion Group, Vol. 36, No. 17.
>          Department of Digital Humanities, University of Cologne
>                        Hosted by DH-Cologne
>
>
>          Date: 2022-05-16 13:46:31+00:00
>          From: Gabor Toth <gabor.toth@maximilianeum.de>
>          Subject: Re: [Humanist] 36.12: working with unnormalised historical
> texts
>
> Dear Crystal,
>
> Many thanks for your detailed answer; congratulations for working out all
> this, which sounds great.
>
> Could you please explain what you mean by the following two points:
>
> 1. "I hand coded the entries for part of speech using a
> simplified Penn Tree Bank system, marked known irregulars for hand
> processing, and built out the other possible variations algorithmically."
>
> I understand the hand coding part but I am not sure about the following
> steps.
>
> 2. "The result was a first draft of over 3 million forms"
>
> That sounds like a very big number, by forms do you mean types?
>
> Cheers,
>
> Gabor



--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
martin.wynne@ling-phil.ox.ac.uk


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php