Humanist Discussion Group, Vol. 35, No. 365. Department of Digital Humanities, University of Cologne Hosted by DH-Cologne www.dhhumanist.org Submit to: humanist@dhhumanist.org Date: 2021-11-19 14:34:55+00:00 From: Ben Brumfield <benwbrum@gmail.com> Subject: crowdsourced proofing for a Latin dictionary Dear Maurizio (and list), Let me recommend that you start by researching "crowdsourced OCR correction" rather than "proofing", as that is the more commonly used term among the crowdsourcing community and is likely to yield more results. To your questions: > the Latin-Italian section of the dictionary will be OCRed and there > will be the need for proofing the resulting text. > so i ask for your help about > 1) previous existing similar experiences of crowdsourced proofing of > digitized texts The three oldest, most successful crowdsourced OCR correction projects were Project Gutenberg's Distributed Proofreaders project, The National Library of Australia's newspaper correction project (on Trove), and Wikisource (a Wikipedia spin-off). The first two are not open to hosting external projects like yours, but their experience--particularly that of the NLA--is worth studying. For a great introduction to the landscape of crowdsourcing in DH, I recommend Melissa Terras's "Crowdsourcing in the Digital Humanities" (https://hcommons.org/deposits/download/hc:15066/CONTENT/mterras_crowdsourcing20 in20digital20humanities_final1.pdf/). A much more comprehensive (and recent) guide is the Collective Wisdom Handbook: Perspectives on Crowdsourcing in Cultural Heritage (https://britishlibrary.pubpub.org/), by Ridge, Blickhan, Ferriter et al. Crowdsourcing OCR correction can be successful, but in our experience we find it very different from crowdsourced manuscript transcription or indexing. There are technical challenges to ingesting the raw OCR text into a platform and associating them with images (https://content.fromthepage.com/crowdsourcing-ocr-correction/). There are also challenges for volunteer motivation and editorial workflow (https://content.fromthepage.com/ocr-correction-vs-transcription/) in OCR correction projects. > 2) how to create the environment for the crowdsourced proofing and > how to select the collaborators (they must know Latin!) > Technically speaking, you should be able to use one of the existing web-based software platforms for OCR correction, like Wikisource, Madoc, PyBossa, or my own FromThePage. (Transkribus may also be an option, but I have not followed their development closely enough to be certain.) One important question is whether to preserve bounding boxes of works/lines/entries through the correction process--a difficult requirement to satisfy, and one that is unlikely to be needed for a TEI-based edition. Another question to ask up front is how the corrected text will be extracted from the platform; some platforms provide exports in TEI-XML already, while others only produce CSVs containing JSON objects--requiring a programmer to make the corrected data usable. Some of the platforms already have communities of volunteers who may be interested in working on the dictionary. However, these need to be Italian speakers, since it is very difficult to recruit volunteers across a language barrier. For that reason, you might start with the Italian Wikisource community (https://it.wikisource.org/wiki/Pagina_principale), which is already doing OCR correction on Italian-language documents. (If you want to start a Wikisource project I recommend contacting someone from GLAMWiki, https://en.wikipedia.org/wiki/Wikipedia:GLAM, to help you interact with the community.) Even if your platform already has a community, you will still want to recruit your own collaborators to the project, since they may be more likely to have specialist knowledge (of Latin), as well as being more likely to help out with later work like TEI mark-up if. Classical philologists are important to reach out to, but they are unlikely to provide the bulk of the labor. Your most dedicated users are likely to be laymen with some background in Latin study: retirees, reenactors, enthusiasts for the Latin mass, classical music fans, adults who studied classics in university but are employed in unrelated industries--all people who enjoy Latin as a part of meaningful leisure. We provide a guide to volunteer outreach (http://bitly.com/fromthepage_volunteers) written by a volunteer as well as semi-monthly webinars (https://content.fromthepage.com/webinars/) on the "soft skills" of crowdsourcing projects but many other crowdsourcing practitioners would be happy to talk about project ideas. (See the CROWDSOURCING mailing list, https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CROWDSOURCING, for more.) Paid crowdsourcing platforms are also an option, but they are very rare in DH or cultural heritage, and the incentives do not reward high quality. (I remember being shocked by the poor quality Sofia Ares Oliveira presented about at DH2018, "Comparing human and machine performances in transcribing 18th century handwritten Venetian script" -- they used a commercial service like Crowdflower, which I suspect to have been the main culprit.) 3) any formal/legal issues that could arise Since the original dictionary was published in 1978, it is likely still under copyright. That means that community-based platforms like Wikisource will be unwilling to host the project unless you are able to provide proof that you have permission to copy it. Private hosts like Veridian or FromThePage are likely to take your word that you have permission and let you bear any legal risk. And if you install OCR correction software on your own servers, of course you need not ask anyone's permission. I will be very interested in following your project, and suspect that my colleagues in the community will as well, so I do hope you'll keep in touch. Best of luck, Ben -- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage (https://fromthepage.com/) _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: humanist@dhhumanist.org List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php