Humanist Discussion Group

Humanist Archives: June 10, 2023, 8:01 a.m. Humanist 37.91 - literary corpora without publishers' data

				
              Humanist Discussion Group, Vol. 37, No. 91.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org


    [1]    From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk>
           Subject: Re: [Humanist] 37.86: literary corpora without publishers' data (51)

    [2]    From: maurizio lana <maurizio.lana@uniupo.it>
           Subject: Re: [Humanist] 37.86: literary corpora without publishers' data (34)


--[1]------------------------------------------------------------------------
        Date: 2023-06-09 08:58:18+00:00
        From: Martin Wynne <martin.wynne@ling-phil.ox.ac.uk>
        Subject: Re: [Humanist] 37.86: literary corpora without publishers' data

Dear Jan et al,

As a curator of a large collection of digital editions in the OTA
Collections (https://llds.ling-phil.ox.ac.uk/), I understand your
frustration. It is reasonably straightforward to download all TCP texts.

You can find the metadata here:

https://github.com/textcreationpartnership/Texts/blob/master/TCP.csv

Based on the metadata, you can select all, or a subcorpus (as Claire
requires) and you'd need to write a simple script to download the XML
texts.

As for 'plain text', we haven't found a format that any two users are
happy with. Do you want to throw away all the metadata? All the
structural tagging? The text identifiers? How do you want to deal with
non-alphabetic material, text in different languages? We leave it up to
the researcher to make these decisions when stripping out the text from
the XML. But I'd be happy to discuss providing a common format if there
is enough demand.

Best wishes,
Martin

> Dear Claire,
>
> As someone who is now putting together (and cleaning) a collection of 10
> thousand electronic literary texts in Polish (both originally written in and
> translated into), I am deeply jealous of people working with Eng Lit. I now
have
> a very good opinion of Project Gutenberg, especially that there is a clever R
> package, gutenbergr (God bless Johnson&Robinson), making it very easy to
> download the texts (and get rid of the legalese). I'm not sure how good it is
> for poetry, but there's always good old textcreationpartnership.org (God bless
> Jonathan Hope et al.).
>
> Which brings me to my pet peeve: why oh why oh why so few of the existing
> digital editions we DHers are (rightly) proud of make it easy (or at all
> possible) to download the entire material in plain text format in one fell
> swoop?
>
> Best,
> Jan

--
Senior Researcher in Corpus Linguistics
Faculty of Linguistics, Philology and Phonetics, University of Oxford
National Co-ordinator, CLARIN-UK
martin.wynne@ling-phil.ox.ac.uk
https://orcid.org/0000-0002-4155-0530

--[2]------------------------------------------------------------------------
        Date: 2023-06-09 08:44:08+00:00
        From: maurizio lana <maurizio.lana@uniupo.it>
        Subject: Re: [Humanist] 37.86: literary corpora without publishers' data

Jan, in digilibLT you have a menu choice allowing, as you would like,
> to download the entire material in plain text format in one fell swoop
but you can get more: in our digital library
(https://digiliblt.uniupo.it/g_bulk_opere.php) you have the choice, for
the whole package, to download the txt format, the epub, the pdf, the
encoded TEI-XML.
every text is accompanied by full publisher's data and the licence is CC
BY-NC-SA.

the idea is that
txt is for those who want to treat texts as (textual) data,
epub for those wanting to read them as ebooks,
pdf for those wanting to have a printed version;
TEI-XML for those who prefer to work with the rich representation of the
text allowed by the TEI encoding.

the txt, pdf, epub, formats are automatically produced from the main
TEI-XML exemplar.
ciao!
Maurizio

------------------------------------------------------------------------

a questo punto devo fare una confessione:
come il mio amico Erri De Luca, sono un europeista estremista.
Questo significa che, per  me, l’Europa unita è l’unica utopia politica
ragionevole che noi europei abbiamo coniato.
Xavier Cercas, inaugurazione del salone del libro, torino 2018

------------------------------------------------------------------------
Maurizio Lana
Università del Piemonte Orientale
Dipartimento di Studi Umanistici
Piazza Roma 36 - 13100 Vercelli


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php