Humanist Discussion Group, Vol. 37, No. 86. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk> Subject: Re: [Humanist] 37.79: literary corpora without publishers' data? (78) [2] From: <jkrybicki@gmail.com> Subject: RE: [Humanist] 37.79: literary corpora without publishers' data? (18) --[1]------------------------------------------------------------------------ Date: 2023-06-08 12:43:29+00:00 From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk> Subject: Re: [Humanist] 37.79: literary corpora without publishers' data? Dear Claire, As you say, the OTA Collections, now available from https://llds.ling-phil.ox.ac.uk/, contain a lot of relevant texts, including all of the EEBO-TCP texts in the public domain (60,238 documents) in TEI XML format, available for download with CC licences. All of the texts have subject classifications, so if you can decide which ones are relevant to your definition of literature, you could in principle navigate to each text and download it. If you'd like to get in touch, I could provide all of the subject classification so that you (or your students) could select the ones that you want and then I could make a subcorpus for you to download. As for Latin poetry, and translations into English, Perseus Digital Library (http://www.perseus.tufts.edu/hopper/) might have what you want. Martin Wynne -- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://orcid.org/0000-0002-4155-0530 On 07/06/2023 05:44, Humanist wrote: > Humanist Discussion Group, Vol. 37, No. 79. > Department of Digital Humanities, University of Cologne > Hosted by DH-Cologne > www.dhhumanist.org > Submit to: humanist@dhhumanist.org > > > > > Date: 2023-06-06 10:54:44+00:00 > From: WARWICK, CLAIRE L. <c.l.h.warwick@durham.ac.uk> > Subject: literary data dump > > Dear everyone, > > Please forgive me if this is an extremely dumb question, but I’m wondering about > how to get hold of a lot of literary text without the kind of front end that > publishers tend to attach to it. The context is this: I have students who want > to work on data science approaches to both English and Latin poetry, eg they > need English translations of the whole corpus of Latin poetry, and a corpus of > all English poetry (or even all literature in English) from 1500-1700, so that > they can train and run language models. > > Of course, my university has access to things like Proquest’s Literature Online, > but that comes with its own search interface, and would mean having to download > each poem one by one, which would be incredibly time consuming and probably not > allowed anyway. Obviously, there is also the OTA, but I don’t know how > comprehensive its holdings are. I also need accurate texts so don’t want to use > things like Project Gutenberg, whose quality I am not certain of. > > I’d be most grateful for any suggestions. > > Best wishes, > > Claire > > > > -------- > Claire Warwick MA, MPhil, PhD > Professor of Digital Humanities > Co-Director Durham Institute of Data Science > Department of English Studies > Durham University > www.durham.ac.uk/staff/c-l-h-warwick/ --[2]------------------------------------------------------------------------ Date: 2023-06-08 10:04:58+00:00 From: <jkrybicki@gmail.com> Subject: RE: [Humanist] 37.79: literary corpora without publishers' data? Dear Claire, As someone who is now putting together (and cleaning) a collection of 10 thousand electronic literary texts in Polish (both originally written in and translated into), I am deeply jealous of people working with Eng Lit. I now have a very good opinion of Project Gutenberg, especially that there is a clever R package, gutenbergr (God bless Johnson&Robinson), making it very easy to download the texts (and get rid of the legalese). I'm not sure how good it is for poetry, but there's always good old textcreationpartnership.org (God bless Jonathan Hope et al.). Which brings me to my pet peeve: why oh why oh why so few of the existing digital editions we DHers are (rightly) proud of make it easy (or at all possible) to download the entire material in plain text format in one fell swoop? Best, Jan _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php