Humanist Discussion Group, Vol. 37, No. 91. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org [1] From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk> Subject: Re: [Humanist] 37.86: literary corpora without publishers' data (51) [2] From: maurizio lana <maurizio.lana@uniupo.it> Subject: Re: [Humanist] 37.86: literary corpora without publishers' data (34) --[1]------------------------------------------------------------------------ Date: 2023-06-09 08:58:18+00:00 From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk> Subject: Re: [Humanist] 37.86: literary corpora without publishers' data Dear Jan et al, As a curator of a large collection of digital editions in the OTA Collections (https://llds.ling-phil.ox.ac.uk/), I understand your frustration. It is reasonably straightforward to download all TCP texts. You can find the metadata here: https://github.com/textcreationpartnership/Texts/blob/master/TCP.csv Based on the metadata, you can select all, or a subcorpus (as Claire requires) and you'd need to write a simple script to download the XML texts. As for 'plain text', we haven't found a format that any two users are happy with. Do you want to throw away all the metadata? All the structural tagging? The text identifiers? How do you want to deal with non-alphabetic material, text in different languages? We leave it up to the researcher to make these decisions when stripping out the text from the XML. But I'd be happy to discuss providing a common format if there is enough demand. Best wishes, Martin > Dear Claire, > > As someone who is now putting together (and cleaning) a collection of 10 > thousand electronic literary texts in Polish (both originally written in and > translated into), I am deeply jealous of people working with Eng Lit. I now have > a very good opinion of Project Gutenberg, especially that there is a clever R > package, gutenbergr (God bless Johnson&Robinson), making it very easy to > download the texts (and get rid of the legalese). I'm not sure how good it is > for poetry, but there's always good old textcreationpartnership.org (God bless > Jonathan Hope et al.). > > Which brings me to my pet peeve: why oh why oh why so few of the existing > digital editions we DHers are (rightly) proud of make it easy (or at all > possible) to download the entire material in plain text format in one fell > swoop? > > Best, > Jan -- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://orcid.org/0000-0002-4155-0530 --[2]------------------------------------------------------------------------ Date: 2023-06-09 08:44:08+00:00 From: maurizio lana <maurizio.lana@uniupo.it> Subject: Re: [Humanist] 37.86: literary corpora without publishers' data Jan, in digilibLT you have a menu choice allowing, as you would like, > to download the entire material in plain text format in one fell swoop but you can get more: in our digital library (https://digiliblt.uniupo.it/g_bulk_opere.php) you have the choice, for the whole package, to download the txt format, the epub, the pdf, the encoded TEI-XML. every text is accompanied by full publisher's data and the licence is CC BY-NC-SA. the idea is that txt is for those who want to treat texts as (textual) data, epub for those wanting to read them as ebooks, pdf for those wanting to have a printed version; TEI-XML for those who prefer to work with the rich representation of the text allowed by the TEI encoding. the txt, pdf, epub, formats are automatically produced from the main TEI-XML exemplar. ciao! Maurizio ------------------------------------------------------------------------ a questo punto devo fare una confessione: come il mio amico Erri De Luca, sono un europeista estremista. Questo significa che, per me, l’Europa unita è l’unica utopia politica ragionevole che noi europei abbiamo coniato. Xavier Cercas, inaugurazione del salone del libro, torino 2018 ------------------------------------------------------------------------ Maurizio Lana Università del Piemonte Orientale Dipartimento di Studi Umanistici Piazza Roma 36 - 13100 Vercelli _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php