Humanist Discussion Group

Humanist Archives: Nov. 20, 2021, 9:18 a.m. Humanist 35.365 - crowdsourcing OCR correction

				                  Humanist Discussion Group, Vol. 35, No. 365.
        Department of Digital Humanities, University of Cologne
                   		Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2021-11-19 14:34:55+00:00
        From: Ben Brumfield 
        Subject: crowdsourced proofing for a Latin dictionary

Dear Maurizio (and list),

Let me recommend that you start by researching "crowdsourced OCR
correction" rather than "proofing", as that is the more commonly used term
among the crowdsourcing community and is likely to yield more results.

To your questions:

> the Latin-Italian section of the dictionary will be OCRed and there
> will be the need for proofing the resulting text.
> so i ask for your help about
> 1) previous existing similar experiences of crowdsourced proofing of
> digitized texts

The three oldest, most successful crowdsourced OCR correction projects were
Project Gutenberg's Distributed Proofreaders project, The National Library
of Australia's newspaper correction project (on Trove), and Wikisource (a
Wikipedia spin-off).  The first two are not open to hosting external
projects like yours, but their experience--particularly that of the NLA--is
worth studying.

For a great introduction to the landscape of crowdsourcing in DH, I
recommend Melissa Terras's "Crowdsourcing in the Digital Humanities"
(https://hcommons.org/deposits/download/hc:15066/CONTENT/mterras_crowdsourcing20
in20digital20humanities_final1.pdf/).

A much more comprehensive (and recent) guide is the Collective Wisdom
Handbook: Perspectives on Crowdsourcing in Cultural Heritage
(https://britishlibrary.pubpub.org/), by Ridge, Blickhan, Ferriter et al.

Crowdsourcing OCR correction can be successful, but in our experience we
find it very different from crowdsourced manuscript transcription or
indexing.  There are technical challenges to ingesting the raw OCR text into 
a platform and associating them with images 
(https://content.fromthepage.com/crowdsourcing-ocr-correction/).  
There are also challenges for volunteer motivation and editorial
workflow (https://content.fromthepage.com/ocr-correction-vs-transcription/)
in OCR correction projects.


> 2) how to create the environment for the crowdsourced proofing and
> how to select the collaborators (they must know Latin!)
>

Technically speaking, you should be able to use one of the existing
web-based software platforms for OCR correction, like Wikisource, Madoc,
PyBossa, or my own FromThePage.  (Transkribus may also be an option, but I
have not followed their development closely enough to be certain.)  One
important question is whether to preserve bounding boxes of
works/lines/entries through the correction process--a difficult requirement
to satisfy, and one that is unlikely to be needed for a TEI-based edition.
Another question to ask up front is how the corrected text will be
extracted from the platform; some platforms provide exports in TEI-XML
already, while others only produce CSVs containing JSON objects--requiring
a programmer to make the corrected data usable.

Some of the platforms already have communities of volunteers who may be
interested in working on the dictionary.  However, these need to be Italian
speakers, since it is very difficult to recruit volunteers across a
language barrier.  For that reason, you might start with the Italian
Wikisource community (https://it.wikisource.org/wiki/Pagina_principale),
which is already doing OCR correction on Italian-language documents.  (If
you want to start a Wikisource project I recommend contacting someone from
GLAMWiki, https://en.wikipedia.org/wiki/Wikipedia:GLAM, to help you
interact with the community.)

Even if your platform already has a community, you will still want to
recruit your own collaborators to the project, since they may be more
likely to have specialist knowledge (of Latin), as well as being more
likely to help out with later work like TEI mark-up if.  Classical
philologists are important to reach out to, but they are unlikely to
provide the bulk of the labor.  Your most dedicated users are likely to be
laymen with some background in Latin study: retirees, reenactors,
enthusiasts for the Latin mass, classical music fans, adults who studied
classics in university but are employed in unrelated industries--all people
who enjoy Latin as a part of meaningful leisure.  We provide a guide to
volunteer outreach (http://bitly.com/fromthepage_volunteers) written by a
volunteer as well as semi-monthly webinars
(https://content.fromthepage.com/webinars/) on the "soft skills" of
crowdsourcing projects but many other crowdsourcing practitioners would be
happy to talk about project ideas.  (See the CROWDSOURCING mailing list,
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CROWDSOURCING, for more.)

Paid crowdsourcing platforms are also an option, but they are very rare in
DH or cultural heritage, and the incentives do not reward high quality.  (I
remember being shocked by the poor quality Sofia Ares Oliveira presented
about at DH2018, "Comparing human and machine performances in transcribing
18th century handwritten Venetian script" -- they used a commercial service
like Crowdflower, which I suspect to have been the main culprit.)

3) any formal/legal issues that could arise

Since the original dictionary was published in 1978, it is likely still
under copyright.  That means that community-based platforms like Wikisource
will be unwilling to host the project unless you are able to provide proof
that you have permission to copy it.  Private hosts like Veridian or
FromThePage are likely to take your word that you have permission and let
you bear any legal risk.  And if you install OCR correction software on
your own servers, of course you need not ask anyone's permission.

I will be very interested in following your project, and suspect that my
colleagues in the community will as well, so I do hope you'll keep in touch.

Best of luck,

Ben


--
Ben W. Brumfield
Partner. Brumfield Labs LLC
Creators of FromThePage (https://fromthepage.com/)


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php