optical scanners (184)

Mon, 3 Apr 89 20:48:33 EDT

Humanist Mailing List, Vol. 2, No. 794. Monday, 3 Apr 1989.

Date: Mon, 3 Apr 89 17:14 EDT
Subject: OCR: sorting it out

Jamie Hubbard (Smith College)

I too have been worrying about "what to do about OCR."
There are many new products being introduced in the OCR market.
I need to evaluate several options and make a recomendation in the
next month or so for the purchase of several (as many as five) units
to begin a major data input project. The main considerations (as I see
them) are:

1) Speed: target = 250,000 pages.

2) Trainable recognition. The material uses roman characters
except that some of them have unusual diacritics (romanized
Pali), so the software must be trainable. Accuracy is of course
a sub-set of this consideration.

3) Column handling. The pages are relatively simple, no columns,
but it does have superscripted footnote numbers, the notes
gathered at the bottom of each page. It would be nice to
automatically handle this, though a simple separation of
text/notes into two files that could later be hand/semi-
automatically put into the text (following the recent suggestions
on Humanist) is probably the limit at this stage (maybe even
beyond the limit).

The following represents a summary of the comments and reviews I
have read to date. I have for the most part just paraphrased the
originals, my apologies in advance if I have misrepresented anything.

INFOWORLD, 9/20/88: CTA of Barcelona released TEXTPERT for the
Mac, claiming 99.5% accuracy. It uses matrix recognition,
topological, and feature-based technologies. "The program is
capable of reading any Macintosh font and typeface and any Indo-
European language." It reads 1500 characters per minute and takes
1 to 2 minutes for a full page typewritten document. Cost is
$995.00. It is marketed n Europe under the name Textscan. CTA
Inc., 747 3rd Avenue, 3rd floor, New York, NY 10017;

A phone rep said (4/3/89) that TextPert was trainable; new
version, 3.0, enhances training operation; includes four
different character sets: roman, Hebrew, Greek, Cyrillic; has a
"dictionary" of 32,000 characters (640 fonts) plus user-defined,
trainable character sets (in other words, once you have trained
special characters they can be "added" to the pre-trained sets,
similar to the way a spell-checker searches user-defined
dictionaries as well as main dictionaries; you can also eliminate
characters that you don't need from the "template" set, such as
trademark symbols, alpha, etc. For example, if your material had
no numbers you could eliminate the numbers in all 640 fonts
(i.e., 6400 less characters to check against, no "one vs. el"
problems); supports output in ASCII, MacWrite, and Microsoft
Word, including font info; supports most major scanners (HP has
sheet-feeder and TextPert can use it). Product has been on the
market in Europe for a number of years, proven in the field, has
support and documentation in some nine different languages. Will
run a Mac Plus with one meg of memory, will run on saved files (in
TIFF??) so that one machine could scan and save images, another
do the recognition, and another format and verify, etc.

INFOWORLD, 9/20/88: Calera Recognition Systems Inc. (formerly The
Palantir Corp.) said that its TRUESCAN boards give "complete
document recognition" capabilities, including automatic
decolumnizing and the recognition and preservation of tables,
lists, and other formats within documents." A board that will
plug into a variety of scanners (most major brands are listed)
the board outputs to a number of major word processor formats (MS
Word, Word Perfect, Xywrite, etc.) and graphics formats. Two
models are listed: Model S, $2495, and Model E, $3495, which is
faster and supports rotating pages.

INFOWORLD, 2/27/89: TrueScan is very easy to use, but lacks any
options to scan only parts of a page; scans both image and text
at the same time, storing results in separate files; seems to
take longer than Omnipage (app. 90-150 seconds per page-- note:
the reviewer used Model E, the faster of the two); very accurate:
3 errors in a 700 word PC WORLD article (99.6%), 32 errors in a
2500 word Forbes article (98.7%). Both Truescan and Omnipage work
with TIFF (meaning you could scan on a number of different
scanners, save the files, and then use the Truescan just for
OCR??); does a very good job of preserving formatting codes
intact in whatever output structure you choose (38 different text
applications, including common word processors and spreadsheets),
such as putting proper characters for underlining and boldface
into the output files.

handled the PC Week that choked the KDM 5000 (see below) "without
a mistake."

PC MAGAZINE, 3/28/89: Accuracy rates higher than even the
Kurzweil 5000; does some context and spell checking for difficult
characters; doesn't do so well on xeroxes and small point sizes.

3) KURZWEIL 5000
Interesting features from sales brochure: book edge scanning, 50
page auto document feeder, 50,000 word lexicon and user-defined
lexicons up to 10,000 words (would seem to allow for semi-auto
"verification" at time of entry), background operation.

HUMANIST, 11/16/88, FROM ERDT@VUVAXCOM (Terrence Erdt): "a part
of a page taken from PC Week, consisting of relatively large type
(English text, about 10 points, I would imagine; proportionally
spaced, typeset, of course). The 5000 was as slow and bungling as
the Kurzweil [Discover] 7320, which I had tested more thoroughly
sometime ago: after about five minutes of processing it produced
nothing more than gibberish." WOW!

PC MAGAZINE, 3/28/89: In an extensive report on "scanners" and
their OCR capabilities, PC MAGAZINE picked the Kurzweil 5000 as
the best of the lot (100% accuracy with Courier, 99.4% with Times
Roman, and 99.8% with Helvitica). It (and the Kurzweil 7320) was
the only system that achieved over 84% accuracy (unacceptably
low) with a Times Roman font (serifed, non-proportional).
Although a number of other machines matched the Kurzweil in
speed, their unacceptable accuracy renders this measure of little
value. The 7320 apparantly doesn't recognize columns. The article
also spoke highly of its 50,000 word lexicon and "modicum of
artificial intelligence," which recognizes "virtually any font"
and attempts to pick difficult characters based on the text as a
whole. The sheet feeder allows unattended scanning, and the
background operation allows other tasks (such as verification) to
procede while the scanner works.

HUMANIST, 10/20/88, FROM mmb@jessica.Stanford.EDU Gives very low
marks to OMNIPAGE: can't read diacritics, can't be trained, no
more accurate than KDM 4000 and often much worse ("possibly
because scanner is lower resolution than KDM"); slower overall
throughput. Good point is its excellent ability to recognize

PC MAGAZINE First Looks, 3/28/89: Omnipage can retain formatting
in the output document, including bold, italic, and even columns.

A phone rep (4/3/89) told me that Omnipage, while not trainable,
will stop and show a character it doesn't recognize, allowing the
operator to input whatever they want. She also said that while
Omnipage would probably see the footnotes as a separate "column,"
they would simply be added to the text after the "first column,"
i.e., the body of the page. That wouldn't help much.

INFOWORLD, 2/27/89: Omnipage will scan either whole pages or let
you specify blocks, columns, or rectangular areas of the page,
and you can "let Omnipage determine the proper order for
recognizing text." (Does this mean that Omnipage will auto-
determine the columns, i.e., could it be easily set up for
distinguishing footnotes at the bottom of the page and saving
them to another file?); slightly less accurate than the Truescan
reviewed in the same article (32 errors in 2500 words, 5 in 700
word article)

SUMMARY: According to various reviewers, Kurzweil, Omnipage, and
Truescan all seem to have good accuracy rates; all handle columns, and
all output (more or less) to a variety of file formats. The Kurzweil
appears to retain the edge due to its accuracy, speed, sheet feeder,
and background operation. With the exception of TextPert, NONE of
these systems are trainable-- I have heard rumours for about six
months that the K5000 will soon be trainable, but the folks at
Kurzweil have (usually) refused to even discuss the possiblity. At the
moment it looks like TextPert is the best choice, except that at present
it only runs on the MAC (MS-DOS version is planned). I haven't seen any
reviews or comparisons of it-- does anybody have any experience with
this system??

Any further info would be appreciated. Jamie