15.066 OCR on hand-printed texts

From: by way of Willard McCarty (willard@lists.village.Virginia.EDU)
Date: Tue Jun 05 2001 - 01:45:54 EDT

  • Next message: by way of Willard McCarty: "15.067 new on WWW: Slovene phrases; Texts & Contexts; software history"

                    Humanist Discussion Group, Vol. 15, No. 66.
           Centre for Computing in the Humanities, King's College London

             Date: Thu, 31 May 2001 06:32:23 +0100
             From: "Jim Marchand" <marchand@ux1.cso.uiuc.edu>
             Subject: Re: 15.061 OCR on hand-printed texts

    Your question as to OCRing 17th century printed texts is right on,
    given the discussion on the difficulty of humanities computing.
    The answer is: it can be done, and I have done it, but it may not
    be worth the candle. NB that we are talking of printed, not
    manuscript texts. First problem: The condition of the book itself,
    how much foxing, splotching and other sorts of things. How many
    abbreviations and ligatures. 2d problem: How to get the text into
    the computer (I leave aside the problem of platforms, etc.).
    Nowadays, according to the shape of the text, I would suggest using a
    digital camera, which makes it possible to avoid all that skewing you get
    when you scan on a flatbed. 3d problem: You will need to have an OCR program
    which can be trained. I would suggest OmniPage Pro from Caere. Do not

    Note that you need as good a copy as possible, so you may have to use a
    graphics program to remove gray levels, spots (use de-
    speckling), etc. This sounds tedious, but can be fairly routine
    once you get into it.

    I have scanned the old Du Cange, a Lapide, and the 18th C. Oxford
    Cicero with good success.

    BTW, there used to be a letter in the Humanist archives on an
    attempt to use the Kurzweil 4000; it is a perfect example of a semi-
    luddite trying to use new technology. Anecdote 2: I remember a
    colleague when I was in Germany in 1988 pointing out to me the
    uselessness of these computers, since he had tried to scan 18th C.
    texts. I was able to make a training set for the Kurzweil 2000
    (splendid machine; I hated to see them go) which performed his task

    We luddites (Maschinenstuermer) have our problems with these
    machines. How do you convert a .pcx to a .gif? Why would you want to?

    This archive was generated by hypermail 2b30 : Tue Jun 05 2001 - 01:53:53 EDT