Humanist Discussion Group

Humanist Archives: June 8, 2023, 6:25 a.m. Humanist 37.83 - literary corpora without publishers' data

				
              Humanist Discussion Group, Vol. 37, No. 83.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Wust, Markus <markus.wust@uni-tuebingen.de>
           Subject: AW: [Humanist] 37.79: literary corpora without publishers' data? (61)

    [2]    From: maurizio lana <maurizio.lana@uniupo.it>
           Subject: Re: [Humanist] 37.79: literary corpora without publishers' data? (24)


--[1]------------------------------------------------------------------------
        Date: 2023-06-07 09:18:08+00:00
        From: Wust, Markus <markus.wust@uni-tuebingen.de>
        Subject: AW: [Humanist] 37.79: literary corpora without publishers' data?

Dear Claire,

You could check with your library to see if some of their licensed collections
came with data mining agreements. If so, there might be ways of accessing the
materials without having to go through the provider's online search interface.

Best,
Markus Wust
University of Tübingen

-----Ursprüngliche Nachricht-----
Von: Humanist <humanist@dhhumanist.org>
Gesendet: Mittwoch, 7. Juni 2023 06:44
An: Wust, Markus <markus.wust@uni-tuebingen.de>
Betreff: [Humanist] 37.79: literary corpora without publishers' data?


              Humanist Discussion Group, Vol. 37, No. 79.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2023-06-06 10:54:44+00:00
        From: WARWICK, CLAIRE L. <c.l.h.warwick@durham.ac.uk>
        Subject: literary data dump

Dear everyone,

Please forgive me if this is an extremely dumb question, but I’m wondering about
how to get hold of a lot of literary text without the kind of front end that
publishers tend to attach to it. The context is this: I have students who want
to work on data science approaches to both English and Latin poetry, eg they
need English translations of the whole corpus of Latin poetry, and a corpus of
all English poetry (or even all literature in English) from 1500-1700, so that
they can train and run language models.

Of course, my university has access to things like Proquest’s Literature Online,
but that comes with its own search interface, and would mean having to download
each poem one by one, which would be incredibly time consuming and probably not
allowed anyway. Obviously, there is also the OTA, but I don’t know how
comprehensive its holdings are. I also need accurate texts so don’t want to use
things like Project Gutenberg, whose quality I am not certain of.

I’d be most grateful for any suggestions.

Best wishes,

Claire



--------
Claire Warwick MA, MPhil, PhD
Professor of Digital Humanities
Co-Director Durham Institute of Data Science Department of English Studies
Durham University www.durham.ac.uk/staff/c-l-h-warwick/


--[2]------------------------------------------------------------------------
        Date: 2023-06-07 08:55:44+00:00
        From: maurizio lana <maurizio.lana@uniupo.it>
        Subject: Re: [Humanist] 37.79: literary corpora without publishers' data?

hi Claire,
the latin texts of PHI cdrom (!) are a good starting point for archaic
and golden period - that is all the 'usually' studied latin authors. the
texts are Beta-encoded but you should easily find tools to de-encode them.

we have also in the digital library digilibLT a lot of late-latin texts,
which are available encoded in TEI-XML but also as simple txt files.
you can get all of them as a single zip file going to
https://digiliblt.uniupo.it/g_bulk_opere.php
best
Maurizio


apriti cielo
sulla frontiera
sulla rotta nera
una vita intera
Mannarino, apriti cielo

------------------------------------------------------------------------
Maurizio Lana
Università del Piemonte Orientale
Dipartimento di Studi Umanistici
Piazza Roma 36 - 13100 Vercelli


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php