Humanist Discussion Group

Humanist Archives: June 9, 2023, 5:31 a.m. Humanist 37.86 - literary corpora without publishers' data

				
              Humanist Discussion Group, Vol. 37, No. 86.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk>
           Subject: Re: [Humanist] 37.79: literary corpora without publishers' data? (78)

    [2]    From:  <jkrybicki@gmail.com>
           Subject: RE: [Humanist] 37.79: literary corpora without publishers' data? (18)


--[1]------------------------------------------------------------------------
        Date: 2023-06-08 12:43:29+00:00
        From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk>
        Subject: Re: [Humanist] 37.79: literary corpora without publishers' data?

Dear Claire,

As you say, the OTA Collections, now available from
https://llds.ling-phil.ox.ac.uk/, contain a lot of relevant texts,
including all of the EEBO-TCP texts in the public domain (60,238
documents) in TEI XML format, available for download with CC licences.
All of the texts have subject classifications, so if you can decide
which ones are relevant to your definition of literature, you could in
principle navigate to each text and download it. If you'd like to get in
touch, I could provide all of the subject classification so that you (or
your students) could select the ones that you want and then I could make
a subcorpus for you to download.

As for Latin poetry, and translations into English, Perseus Digital
Library (http://www.perseus.tufts.edu/hopper/) might have what you want.

Martin Wynne

--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
martin.wynne@ling-phil.ox.ac.uk
https://orcid.org/0000-0002-4155-0530



On 07/06/2023 05:44, Humanist wrote:
>                Humanist Discussion Group, Vol. 37, No. 79.
>          Department of Digital Humanities, University of Cologne
>                        Hosted by DH-Cologne
>                         www.dhhumanist.org
>                  Submit to: humanist@dhhumanist.org
>
>
>
>
>          Date: 2023-06-06 10:54:44+00:00
>          From: WARWICK, CLAIRE L. <c.l.h.warwick@durham.ac.uk>
>          Subject: literary data dump
>
> Dear everyone,
>
> Please forgive me if this is an extremely dumb question, but I’m wondering
about
> how to get hold of a lot of literary text without the kind of front end that
> publishers tend to attach to it. The context is this: I have students who want
> to work on data science approaches to both English and Latin poetry, eg they
> need English translations of the whole corpus of Latin poetry, and a corpus of
> all English poetry (or even all literature in English) from 1500-1700, so that
> they can train and run language models.
>
> Of course, my university has access to things like Proquest’s Literature
Online,
> but that comes with its own search interface, and would mean having to
download
> each poem one by one, which would be incredibly time consuming and probably
not
> allowed anyway. Obviously, there is also the OTA, but I don’t know how
> comprehensive its holdings are. I also need accurate texts so don’t want to
use
> things like Project Gutenberg, whose quality I am not certain of.
>
> I’d be most grateful for any suggestions.
>
> Best wishes,
>
> Claire
>
>
>
> --------
> Claire Warwick MA, MPhil, PhD
> Professor of Digital Humanities
> Co-Director Durham Institute of Data Science
> Department of English Studies
> Durham University
> www.durham.ac.uk/staff/c-l-h-warwick/

--[2]------------------------------------------------------------------------
        Date: 2023-06-08 10:04:58+00:00
        From:  <jkrybicki@gmail.com>
        Subject: RE: [Humanist] 37.79: literary corpora without publishers' data?

Dear Claire,

As someone who is now putting together (and cleaning) a collection of 10
thousand electronic literary texts in Polish (both originally written in and
translated into), I am deeply jealous of people working with Eng Lit. I now have
a very good opinion of Project Gutenberg, especially that there is a clever R
package, gutenbergr (God bless Johnson&Robinson), making it very easy to
download the texts (and get rid of the legalese). I'm not sure how good it is
for poetry, but there's always good old textcreationpartnership.org (God bless
Jonathan Hope et al.).

Which brings me to my pet peeve: why oh why oh why so few of the existing
digital editions we DHers are (rightly) proud of make it easy (or at all
possible) to download the entire material in plain text format in one fell
swoop?

Best,
Jan


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php