3.1247 electronic texts, paradigms, scanning (175)

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Mon, 2 Apr 90 19:50:54 EDT

Humanist Discussion Group, Vol. 3, No. 1247. Monday, 2 Apr 1990.

(1) Date: Mon, 2 Apr 1990 09:14:58 EST (36 lines)
From: Jan Eveleth <EVELETH@YALEVM>
Subject: Information Distribution

(2) Date: Sat, 31 Mar 90 13:19:48 EST (43 lines)
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1235 scanning and e-texts, cont. (251)

(3) DATE: 02 APR 90 11:02 CET (26 lines)
FROM: A400101@DM0LRZ01
SUBJECT: etexts and data format/preservation (3.1231ff.)

(4) Date: 02 Apr 90 11:28:19 EST (40 lines)
From: James O'Donnell <JODONNEL@PENNSAS>
Subject: Scanning the libraries

(1) --------------------------------------------------------------------
Date: Mon, 2 Apr 1990 09:14:58 EST
From: Jan Eveleth <EVELETH@YALEVM>
Subject: Information Distribution

Duplication and distribution of the information repositories in an
electronic-mediated society are critical; I strongly agree with D. Harbin.

His note brought to mind a recent news story reporting a near "accident"
at a nuclear power plant (was it in the south somewhere?). It seems the
nuclear power plant runs on electricity from another, non-nuclear, local
power utility. Something caused the local power utility to lose power
production leaving the nuclear power plant in a desperate and dangerous
predicament as it lost control of its safety systems and control panels.
What is the reality of implementing new and more powerful technologies
on limited budgets? The new technologies are bootstrapped on top of
older technologies and the proverbial weak link remains the measure of

Information theory, as interpreted from my "lay" perspective, says that
an evolving structure will be constrained in its future configurations
by the historical events that formed the structure. In biology, for
example, the DNA of a species not only provides the foundation for what a
species can become, but it also defines what the species can *not*
become, e.g. a bird species will not evolve (or in this case "devolve")
into a reptile species.

Perhaps the best strategy for implementing a new electronic information
repository is to start from scratch, to think beyond what has been and
consider how it could be. (The Text Encoding Initiative, from what I've
read, seems to be taking this strategy.) Microfilming and scanning will
preserve what is currently available, but surely the dream of an age of
electronic information can reach beyond the bootstraps and constraints
of the current paradigm?

--Jan Eveleth
Yale University
(2) --------------------------------------------------------------51----
Date: Sat, 31 Mar 90 13:19:48 EST
From: "Michael S. Hart" <HART@UIUCVMD>
Subject: Re: 3.1235 scanning and e-texts, cont. (251)

I would like the particpants in this discussion of scanning and etexts
to be a little more specific in the work for which they estimate costs
as it would appear to be totally out of line with real-world typing, a
market which is probably easily analyzable in each of your communities
as an academic side market. In all the college towns I am aware of, I
even include those in which apartments are well over $1,000/month, the
price of typing theses is under $1 per page. If we assume the normal,
is there such a thing, thesis, to have perhaps 1/3 the material in its
page, double spaced, wide margins, etc, then we could raise that price
to $3 per page. If we assume the average book we want to put in etext
form at 333 1/3 pages, then we should be able to hire out the work for
$1,000 per volume. Thesis preparers have to know all the rule for the
colleges for which they prepare, and I think we can assume these rules
are no more complicated than those for typing in a book.

The cost of tagging, encoding, SGMLing, TEIing, PostScripting, TEXing,
or whatever-the-type-of-massaging-you-have-in-mind is totally up to an
individual and cannot be calculated here.

ONLY TO ESOTERIC USERS. Of course, there are those, in the majority,
who constantly harangue etext altogether because it is only for these
minorities represented well in this discussion.

By the way, at least on of our Gutenberg members is willing to create
etexts at only 50 cents per page, a fact I mentioned earlier, but the
response was zero.

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

(3) --------------------------------------------------------------31----
DATE: 02 APR 90 11:02 CET
FROM: A400101@DM0LRZ01
SUBJECT: etexts and data format/preservation (3.1231ff.)

Sorry to challenge Michael Hart out of his self-imposed retirement, but
obsolete data formats _are_ a problem. Naturally, the texts in general
use (representing probably much less than 1% of all surviving texts) are
going to be constantly recopied, and data formats are unproblematical.
But the majority (we started out, remember, from the idea of scanning
the LOC) will sit around in libraries waiting to be looked at once every
twenty? fifty? hundred? years, as most books do at present. They thus
have to survive for long periods of time and then be readable
afterwards. To visualize what this means I suggest the following thought
1. going into every hi-fi shop in your home town and asking to buy
equipment which will play back shellack records at 78 rpm, over-the-
counter and at a reasonable price.
2. storing a large text on an MS-DOS disk now, shutting it away in suitable
protected conditions and then walking into your university computer
centre in 20/30 years' time and asking them to upload it to the then
If the results of these thought experiments make you happy, fine.
Otherwise, how do we ensure immortality for our hard-scanned efforts?
Timothy Reuter, Monumenta Germaniae Historica

(4) --------------------------------------------------------------45----
Date: 02 Apr 90 11:28:19 EST
From: James O'Donnell <JODONNEL@PENNSAS>
Subject: Scanning the libraries

From: Jim O'Donnell (Penn, Classics)

Scan all LC? Wonderful! But there's one way to make the job a lot cheaper, and
another that will make it more expensive.

Cheaper: Lots of stuff in LC not urgently necessary. Hundreds of volumes of
census data from India, for example; or dozens of different editions of
*Moby-Dick*. Get it all eventually, sure; but the utility of the *first*
scanned copy of *Moby-Dick* is about 100 times greater (1000 times?) than that
of the second through hundredth versions put together. I look at the Bryn Mawr
College library, a collection of about half a million volumes (5% the size of
LC?), and see that for the bookish disciplines (basically: humanities), that
collection serves 90-95% of need. Complete works of all major and very many
minor authors, standard journals in all fields, and excellent, if selective,
coverage of current scholarly publication. Perhaps you would need to enhance
the coverage to about 750K or 1 million to include some fields in which that
kind of collection is weak, or you would decide you absolutely had to have the
Sitzungsberichte of the major German academies. But even so, you could get to
the point where a *very* satisfactory working library was available on-line
while reducing costs over those of doing all LC by at least an order of
magnitude. (Heuristic way to proceed? How about using circulation records?
Start with a really big library, to ensure that almost everything is there,
and scan everything that's been checked out twice in the last ten years? Then
go back gradually and pick up the rest over the next few decades?)

More expensive: But when we're being thorough, we must remember to coordinate
with non-American sources. That can save some money (a nice argument would be
that the industrialized democracies should all scan whatever was published
within their borders), but also cost some: good coverage of 17th, 18th, and
19th century European publication will not add *vast* numbers to the total
from LC, but it will be hard work to locate and make sure that things are
included: can't just start at one end of the shelves and scan to the other.
And whatever mechanization is possible with 20th cent. publications won't work
with fragile older volumes. The real danger in scanning our way into the next
technology, is that a lot of our memory will fade, neglected because too old,
too dusty, too fragile, too uneconomical to bother with.