3.1215 scanning and e-texts (204)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Mon, 26 Mar 90 20:01:03 EST

Humanist Discussion Group, Vol. 3, No. 1215. Monday, 26 Mar 1990.

(1) Date: Sun, 25 Mar 90 16:09:28 PST (15 lines)
From: hcf1dahl@UCSBUXA.BITNET
Subject: Scanning

(2) Date: Sun, 25 Mar 90 21:52:32 EST (70 lines)
From: amsler@flash.bellcore.com (Robert A Amsler)
Subject: More re: $10K/book encodings

(3) Date: Mon, 26 Mar 90 09:22:00 EST (18 lines)
From: DEL2@phoenix.cambridge.ac.uk
Subject: Re: [3.1208 super-scanning and its costs (137)]

(4) Date: Mon, 26 Mar 90 10:03:00 EST (37 lines)
From: (Robert Philip Weber) WEBER@HARVARDA
Subject: Re: 3.1208 super-scanning and its costs (137)

(5) Date: Mon, 26 Mar 90 12:57:09 EST (28 lines)
From: Richard Ristow <AP430001@BROWNVM>
Subject: Re: 3.1211 electronic texts (101

(1) --------------------------------------------------------------------
Date: Sun, 25 Mar 90 16:09:28 PST
From: hcf1dahl@UCSBUXA.BITNET
Subject: Scanning

The Humanities Computing Facility of the University of
California at Santa Barbara has just acquired a Kurzweil
Model 5100 scanner, a fairly sophisticated example of the
species, and I'll be reporting on our experiences of its
capabilities and foibles in the coming weeks.

Eric Dahlin
Humanities Computing Facility
(2) --------------------------------------------------------------84----
Date: Sun, 25 Mar 90 21:52:32 EST
From: amsler@flash.bellcore.com (Robert A Amsler)
Subject: More re: $10K/book encodings

A few people have been corresponding with me privately regarding
my estimate of $10K per book to encode the Library of Congress in
usable machine-readable form. I feel I should explain a little
further why that figure is so high (maybe too high, but not by
too much considering what I expect from machine-readable encodings).

First, I am a bit surprised that so many humanists seem to conceive
of the works in the LofC as being only text. I've been trying to
imagine what ten randomly selected works in the Library of Congress
would contain. I see books with photographs (some consisting
almost entirely of photographs), color illustrations, engravings,
music, strange typographic conventions, manuscript pages,
scientific notations, patent drawings, tables of numbers,
maps, etc. What should be done with these? How can they be
encoded such that they are USABLE and SEARCHABLE by the computer?
For some we lack the skills. Photographs cannot be readily
searched to answer simple questions about their contents (find
me photos containing umbrellas in the 19th century!), but for
many of the other types of materials we can imagine what to do.

Second, I want uncompromisingly good encodings. I don't see
anyone wanting to pay for encodings that are inadequate by
some future scholar's standards of what should have been
captured. Sure, as individual researchers we can just encode
the features WE see as important, but for a national library
everyone's interests would have to be considered.
That means typographic conventions MUST be included, line
and page boundaries, word hyphens, paragraph indentations,
minor type variations, type styles, colors should be authenticated
as to exactly what color was there in the ink, alignments need
to be precise in order for someone to live with the machine-readable
version alone.

Perhaps if I suggest a context. NASA decides that they want to
plan for the launching of a colony ship to Alpha Centauri and back
in the year 2020. The original passengers will die en route
and their children's children will reach Alpha Centauri.
To make this voyage feasible, the passengers will take along
all the Earth's knowledge. However, since storage space is
quite finite and access will need to be quite rapid, the knowledge
will have to be reduced to machine-readable form and access.
In fact, the access will involve both electronic imagery on
advanced workstations AND the ability to reconstruct a replica of
the original work on some material such as Tyvec (that's what those
unbreakable smooth plastic envelopes are made of).

Every concern should be considered to make this body of recorded
knowledge adequate for rapid access and as near possible perfect
reproduction. Assuming the technology for all this is possible,
with special printing systems, new display systems, being developed
by the time of the voyage, what can be done to start the recording
of the information in the libraries now?

Now, the OED, I am told, spent nearly $4M on its 24 volume dictionary
to render it machine-readable. That's $166,667 per book. If that
is pretty much a worst case, then the average work will only cost 1/10
to 1/100th that much, say between $16,667 to $1,667 each.
And here I admit, my $200 billion might be reduced to $20 billion.

Does this help? I am not trying for a quick fix. I am trying for
a machine-readable replacement that will meet all known needs
which the contents of the original work would have met and assuming the
original work WILL NOT BE AVAILABLE AGAIN after the encoding is
made. What then would be needed?

(3) --------------------------------------------------------------26----
Date: Mon, 26 Mar 90 09:22:00 EST
From: DEL2@phoenix.cambridge.ac.uk
Subject: Re: [3.1208 super-scanning and its costs (137)]

Isn't the debate about Steve DeRose's suggestion (LoC on CD) beginning to
miss the point a bit if we concentrate on the problems of, eg, putting in
19Century books or incunabla? In this age of electronic publication,
when most publishers do *something* via a computer for most books, I
gather that a large number of texts get put onto tape only to be
wiped off at a later stage. What a waste of effort! How about putting our
energies into finding some way of ensuring that all *future* texts get
stored in some e-repository; if only to await the day when retrieval
programs would permit their easy use. I wouldn't object to being given
whatever e-form of the second ed of the OED exists (the CD is the *first*
ed) with all its imperfections, idiosyncratic mark-up, or whatever it
was that prevented a CD issue). I certainly cannot afford to buy the
99% perfect e-first-edition.
Douglas de Lacey.
(4) --------------------------------------------------------------46----
Date: Mon, 26 Mar 90 10:03:00 EST
From: (Robert Philip Weber) WEBER@HARVARDA
Subject: Re: 3.1208 super-scanning and its costs (137)

amsler@flash.bellcore.com (Robert A Amsler) writes:

>If we accept an estimate of 20 Terabytes as the size of the Library
>of Congress, then at .5-1.0 megabyte per volume we'd get
>something like 20-40 million books. At $10,000 per book, that
>gives me something like $200 billion to represent the Library of
>Congress in a format as useful as the Oxford English Dictionary.

ALthough i am willing to accept Amsler's estimates of the size of the
library, I find his cost figures high by an order or magnitude, at
least. By the end of the decade we will have reached the point
where the cost of digitizing is less than or equal to the cost of
microfilming. Today, that cost is about $100 per average book.
Even of the cost of all this doubled, to say $200 per average book,
that's still a reasonable total sum. and even if that doubled because
some materials were difficult to put into digital format, we are
now up to $400 per book. Thus it seems to me that the MOST this
project could be would be $16 billion. spread over 16 years, this
is a billion a year, which is not unthinkable.

In my article in Publishers Weekly (January 12, 1990, pgs 38-39),
i envision just such a possibility (among others). It may well
make economic sense and good information policy sense.
Bob Weber
Robert Philip Weber, Ph.D. | Phone: (617) 495-3744
Senior Consultant | Fax: (617) 495-0750
Academic and Planning Services | Bitnet: weber@harvarda
Division | Internet: weber@sunrise.harvard.edu
Office For Information Technology| weber@popvax.harvard.edu
Harvard University | weber@world.std.com
50 Church Street |
Cambridge MA 02138 |
(5) --------------------------------------------------------------36----
Date: Mon, 26 Mar 90 12:57:09 EST
From: Richard Ristow <AP430001@BROWNVM>
Subject: Re: 3.1211 electronic texts (101

In Humanist 3.1211 (on electronic texts) Michael S. Hart writes

> ... , it requires minimal effort, time, or expense to log in to that
>computer which contains the master text to correct an error, which those
>copies made subsequently would each contain the upgrade without muss and
>fuss, with the version number perhaps upgraded by .001 to identify these
>variant editions for the purists.

The "minimal effort ... " is right, and is a common and dangerous way
of getting into trouble with machine-readable information. It's very
easy for two people, or even one person, to have two different working
copies of a text or file, and even if they are 'purists' and check the
version numbers to know they ARE different, to have no way of knowing
what the differences are, or the reasons for believing the later version
is more accurate than the earlier. It can be maddeningly difficult to
discover whether a perceived difference is an inaccuracy in memory, a
difference in interpretation, or a change in the text. It can be nearly
impossible to either prevent or detect the well-meaning, 'obvious', but
incorrect 'correction'. A serious on-line revision system should log all
changes, with date-time made or earliest version where applied or both,
and the authority for the change (even the name of the individual making
the physical edit is invaluable for tracking problems and stimulating care).