Humanist Discussion Group, Vol. 15, No. 66.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>
Date: Thu, 31 May 2001 06:32:23 +0100
From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
Subject: Re: 15.061 OCR on hand-printed texts
Your question as to OCRing 17th century printed texts is right on,
given the discussion on the difficulty of humanities computing.
The answer is: it can be done, and I have done it, but it may not
be worth the candle. NB that we are talking of printed, not
manuscript texts. First problem: The condition of the book itself,
how much foxing, splotching and other sorts of things. How many
abbreviations and ligatures. 2d problem: How to get the text into
the computer (I leave aside the problem of platforms, etc.).
Nowadays, according to the shape of the text, I would suggest using a
digital camera, which makes it possible to avoid all that skewing you get
when you scan on a flatbed. 3d problem: You will need to have an OCR program
which can be trained. I would suggest OmniPage Pro from Caere. Do not
overtrain.
Note that you need as good a copy as possible, so you may have to use a
graphics program to remove gray levels, spots (use de-
speckling), etc. This sounds tedious, but can be fairly routine
once you get into it.
I have scanned the old Du Cange, a Lapide, and the 18th C. Oxford
Cicero with good success.
BTW, there used to be a letter in the Humanist archives on an
attempt to use the Kurzweil 4000; it is a perfect example of a semi-
luddite trying to use new technology. Anecdote 2: I remember a
colleague when I was in Germany in 1988 pointing out to me the
uselessness of these computers, since he had tried to scan 18th C.
texts. I was able to make a training set for the Kurzweil 2000
(splendid machine; I hated to see them go) which performed his task
satisfactorily.
We luddites (Maschinenstuermer) have our problems with these
machines. How do you convert a .pcx to a .gif? Why would you want to?
This archive was generated by hypermail 2b30 : Tue Jun 05 2001 - 01:53:53 EDT