9.655 OCR software

Humanist (mccarty@phoenix.Princeton.EDU)
Tue, 26 Mar 1996 18:50:31 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 655.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)
Information at http://www.princeton.edu/~mccarty/humanist/

[1] From: Jim Marchand <marchand@ux1.cso.uiuc.edu> (17)
Subject: OCR

[2] From: John Price-Wilkin <jpwilkin@umich.edu> (43)
Subject: Re: 9.651 the state of OCR?

[3] From: Jean Anderson <jganders@human.gla.ac.uk> (12)
Subject: Re: 9.651 the state of OCR?

[4] From: Roger Brisson <rob@psulias.psu.edu> (40)
Subject: Re: OCR

Date: Mon, 25 Mar 96 19:22:31 CST
From: Jim Marchand <marchand@ux1.cso.uiuc.edu>
Subject: OCR

I have tried most of them and agree with you that OmniPage Pro is the best.
I do lots of scanning, formerly because of neuropathy (could not type), now
because I like it. In its latest incarnation, OmniPage Pro has the fault
that one is required during the installation program to register the program
in order to avoid being terminated by their counter (I believe one gets 15
scans before having to register). I do not like to be told to do anything
with a piece of software I have paid good money for. If you do a lot of
text scanning, it is well to train it to read the particular font used in a
series. I trained it to read SUGNL fonts, that being the largest collection
of Old Norse texts; now I can scan in any one I want with no trouble. I
have not tried it with non-European scripts, but I had great success there
with the old Kurzweil 4000. The Kurzweil Discover had many good features,
among others that one could scan while working on other things, since it had
its own CPU and did not tie up the machine it was connected to. Alas, it
was not trainable. It is obvious that we still have lots of work to do in
the OCR field. BTW, do not hesitate to use all the languages you need with
OmniPage Pro; I use about 7 and have not noticed any appreciable slow-down.
Jim Marchand.

Date: Mon, 25 Mar 1996 20:12:20 -0500 (EST)
From: John Price-Wilkin <jpwilkin@umich.edu>
Subject: Re: 9.651 the state of OCR?

I think one thing that needs to be emphasized is that OCR programs are
suited to particular tasks, and a package that might be good for one type
of text might not be good for another. Our main OCR person uses
TypeReader for most English texts because it has proven most accurate on
older typefaces. She has tended to use WordScan after this, and rarely
OmniPage. When we need to go for trainablity, we go to Xerox's ScanWorx.
ScanWorx is a unix package that, despite assurances from the developers
that the engine is the same as that used with TextBridge (which we beta
tested), is infinitely more intelligent than TextBridge Pro. Still, the
interface in ScanWorx is really crude and makes whipping through hundreds
of pages of text a real chore, while the TypeReader package makes this a


On Mon, 25 Mar 1996, Humanist wrote:

Date: Tue, 26 Mar 1996 10:24:07 +0001
From: Jean Anderson <jganders@human.gla.ac.uk>
Subject: Re: 9.651 the state of OCR?

Does anyone know what happened to OPTOPUS? A couple of years
ago it was said to be by far the best OCR package - but too
expensive for most of us to try.

Jean Anderson
Resource Development Officer
Arts Technical Resource Unit
University of Glasgow, 6 University Gardens
Glasgow G12 8QH
0141 330 4980

Date: Tue, 26 Mar 1996 10:16:56 -0500
From: Roger Brisson <rob@psulias.psu.edu>
Subject: Re: OCR

The recent reviews I've seen on OCR software confirm your assessment
that OmniPage Pro has a clear lead on the other products. I have both
TextBridge and OmniPage Pro, and I find OmniPage much superior. I use
OmniPage (for the Mac) at home on my NeXT workstation for German and English
texts, and I have been pleasantly surprised again and again at what it can
usefully recognize. We are getting the latest version of WordScan set up in
our department, and since its from the Caere Corp. (which also makes
OmniPage) we are expecting it to perform just or almost as well (I
understand it uses the same recognition engine as OmniPage). I recently
used OmniPage to scan and recognize Hegel's Phenomenologie des Geistes in
entirety; it took one weekend to accomplish and the clean-up was negligible
(I'm creating a critical electronic edition of the 1831 edition for the Web
so the post-scanning editorial work is of course sizable).
Because OmniPage Pro is both highly automated and easy to use, it
has become a personal workhorse for everything from scanning in letters,
short documents, well pretty much anything that is of reasonable print
quality. This has been extremely useful for getting paper-based materials
on the Web. Like you I've been following OCR technology closely for
several years now, and for me OmniPage Pro 5.0 represents the first
generation of OCR software that really does what it is supposed to achieve:
to make automated input of printed text (significantly) faster and easier
than by doing it manually. It does have its limits; in particular for
photocopies of poor quality the accuracy rate nosedives quickly. In this
respect it still has a ways to go, if one considers the ideal to be
recognizing with 100% accuracy anything that we can still read comfortably.
But as I've said, I've been more surprised at what it can accomplish than
what it can't. I recently tried the new version (6.0), which according to
Caers has a better recognition engine, but a couple of personal tests I
perfomed did not bear this out.
I don't have the cite with me, but Mark Olsen of ARTFL has written
an interesting article on his experiences with OCR for large text inputting
projects (for ARTFL). I believe it appeared in Computers and the Humanities
from a couple of years ago. It's a bit dated now, but the problems he
discusses when using OCR for large projects are interesting and in many
respects still valid. I read it some time ago, so I don't know if a product
like OmniPage Pro 6.0 would change his assessment of the usefulness of OCR
in any way.

Roger Brisson
Penn State University