3.1259 scanning and encoding texts (204)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Wed, 4 Apr 90 21:15:44 EDT

Humanist Discussion Group, Vol. 3, No. 1259. Wednesday, 4 Apr 1990.

(1) Date: 4 April 1990 10:12:56 CDT (85 lines)
From: "Michael Sperberg-McQueen 312 996-2477 -2981" <U35395@UICVM>
Subject: costs of scanning

(2) Date: Wed, 4 Apr 90 13:00:43 EDT (36 lines)
From: elli@harvunxw.BITNET (Elli Mylonas)
Subject: Scanning texts

(3) Date: Wed, 4 Apr 90 14:54:16 EDT (23 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1253 e-texts, stemmatology, scanning (107)

(4) Date: Wed, 4 Apr 90 14:59:07 EDT (17 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1253 e-texts, stemmatology, scanning (107)

(5) Date: Wed, 4 Apr 90 14:09:24 CDT (31 lines)
From: Michael S. Hart
Subject: Reply to Ken Steele

(1) --------------------------------------------------------------------
Date: 4 April 1990 10:12:56 CDT
From: "Michael Sperberg-McQueen 312 996-2477 -2981" <U35395@UICVM>
Subject: costs of scanning

In his recent notes Michael S. Hart once more uses the slogan of 'ASCII
only' text to confuse the issues of device independence and absence of
markup. Perhaps it's time to clarify some basics.

First of all, markup is not necessarily non-ASCII (many markup schemes
use only ASCII characters), and ASCII-only text is not necessarily
markup-free. Second, ASCII itself is not wholly device-independent
(because many devices use other character sets, either non-U.S.
standards or vendor-specific character sets). Readers of this list will
recall extensive discussions in the past about network-safe character
sets, the upshot of which was that only a subset of ASCII is generally
assured of arriving legible at other nodes on the net. Nor is
restriction to a specific set of characters by itself enough to ensure
that texts are independent of specific devices or software. And finally
the abolition of markup is not possible without lobotomizing our texts.

Mr. Hart observes that to be reusable and survive long, machine-readable
texts need to be portable to many machines. Right. From this he infers
that they ought also to be markup-free. Wrong. To be useful and live
long, texts should be device-independent, but to be useful they must not
be markup free.

The claim that texts can and should be represented without markup
because markup is costly and subjective misses three boats.

1 No text is entirely free of markup in the broad sense with the
possible exception of some older Greek and Hebrew manuscripts written in
scriptio continua.

2 No clear boundary can be drawn between the "facts" of a text and our
interpretation. Word boundaries are interpretive if the source is in
scriptio continua, vowels are interpretive if the source is unpointed
Hebrew or Arabic, verse boundaries are interpretive if the source is
written run-on (as many medieval verse manuscripts are). The
significance or insignificance of line breaks in a printed text is
inherently an interpretive decision (which may have important
text-critical implications in, say, the First Folio). All these
interpretations can be expressed in ASCII-only texts without any SGML or
similar markup.

3 The restriction of markup to that provided directly by ASCII --
effectively a restriction to procedural markup readily performed by a
teletype machine of the late 1950s (tabs, carriage returns, line feeds,
backspaces, and the occasional bell) represents a misguided and
inadequate theory of texts which in effect claims that the only
important thing about a text is its sequence of graphemes.

*** No electronic tool working with such a text can possibly know
anything interesting about it. ***

There is no representation of chapter divisions or sentences, and so no
reliable searching for words co-occurring within those contexts. There
is no indication of structure within (say) a dictionary entry, so one
cannot search for 'French' or 'F.' used within definitions or quotations
as opposed to within etymologies. Nor can one search reliably for the
date '1791', because it must be manually disambiguated from a reference
to 'p. 1791' or other numbers. (Dictionaries are really an appallingly
bad example for any proponent of markup-free texts to choose.) ASCII has
no representation for dialect or language shifts and thus no ability to
distinguish English 'the' from older French or German 'The'. And of
course there is no real way to represent French or German texts in ASCII
because ASCII has no method of representing diacritics.

Simplicity and system-independence are good goals. But adequacy of our
representation of the text is an even more important goal. If our
representations of texts are not usable and lack all attempt to deal
with the realities of texts' complexities, then they are not much use to
anyone no matter how simple and system-independent they are.

The TEI is an attempt to provide precisely what is lacking: a
documented, device-independent, intellectually serious markup scheme for
texts used in literary, linguistic, and other textual research.

We should not be aiming at markup-free texts. We should be aiming at
device-independent texts with enough markup to do what we want to do
with them. (Room for lots of flexibility on that point.) The
'ASCII-only' slogan is misleading and chimerical.

-Michael Sperberg-McQueen
ACH / ACL / ALLC Text Encoding Initiative
University of Illinois at Chicago
(2) --------------------------------------------------------------42----
Date: Wed, 4 Apr 90 13:00:43 EDT
From: elli@harvunxw.BITNET (Elli Mylonas)
Subject: Scanning texts

I would just like to add some more numbers to the confusion. It seems
to me that good accuracy can be obtained at a respectable price,
and even have some basic tagging thrown in, when texts are keyboarded.
We get Greek done at over 99% accuracy --that translates to an error or
two a page, at most--for $2500 a Mb. Texts are typed in twice, and
compared, which eliminates most typos. The data entry is done off-shore,
and can take less than a month for several megabytes, counting mailing
it to the American middleman, and having it arrive back, via him.
(That adds about 10 days on)
In any case, once a pipeline is started, it goes quite fast.

This handles accuracy and data entry. As for tagging, the keypunch
people will add minimal tagging, or extensive tagging if it has been
marked in the book. As a matter of fact, the texts that come back
have my tagging scheme pencilled in, throughout. Unfortunately,the
keyboarders or their mark-up person can only handle tagging sections
of text that are visibly differentiated from the rest on their own.
They are very good and consistent at following a pattern, and please
note, they are doing Greek to beta code!!

If, to the $2500 for the data entry, we add $500 (20 hours of a
high level person to do markup before data entry, and some basic
verification after it, then we are still at $3K a Mb. That is a
far cry from $40K for 4 Mb. This process can certainly be used for
newer books that do not have complex structures, and so will not
require sophisticated tagging. But we were not really talking about
tagging to begin with!!

That's it for this fuel on the fire.
Elli Mylonas, Managing Editor, Perseus Project
Harvard University
(3) --------------------------------------------------------------31----
Date: Wed, 4 Apr 90 14:54:16 EDT
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1253 e-texts, stemmatology, scanning (107)

re: Bob Hollander's comments. We are in contact with both his project
at Priceton-Rutger, and with Dartmouuth directly. While we offer to
assist with proofreading, scanning, etc, we have received no requests
for assistance with the Dante project. Perhaps someone could clarify
if the Dante project does ONLY COMMENTARIES or the Dante text itself,
and whether the Dante text is available in English. Project Gutenberg
deals only with English etexts, and tries to put its efforts into doing
etexts of original works, rather than on works about other works, even
as much as we would like to do the Annotated Alice, someday.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

(4) --------------------------------------------------------------26----
Date: Wed, 4 Apr 90 14:59:07 EDT
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1253 e-texts, stemmatology, scanning (107)

re Bob Hollander's other point, concerning ACCURACY in etexts.

The mounds of time and money he claims is being spent checking for
ACCURACY, as he puts is, are being misspent. One of the greatest
advantages of etext is that the users can easily correct any errors.
Examples: the University of Illinois library used to have an excellent
book on ROBITICS listed in the Library Computer Service, through which
about 75% of the materials are checked out. Any time a patron finds such
an error, they can report it, either by phone or by email. To hire people
to proofread the entire catalog every few years would be a waste, but the
system is self-improving in this manner, AND AT MINIMAL EXPENSE.

Michael S. Hart
(5) --------------------------------------------------------------46----
Date: Wed, 4 Apr 90 14:09:24 CDT
From: Michael S. Hart
Subject: Reply to Ken Steele

I would suggest Mark Zimmermann's FREE TEXT as an example of ANY USEFUL
text retrieval which requires NO MARKUP. It is also available free of
charge. Please send inquiries to HART@UIUCVMD, and I will forward them.
Michael S. Hart, Director, Project Gutenberg