3.397 Kurzweil 5100 scanner (114)

Sat, 26 Aug 89 17:28:56 EDT

Humanist Discussion Group, Vol. 3, No. 397. Saturday, 26 Aug 1989.

Date: Sat, 26 Aug 89 01:30:18 CST
From: "Robin C. Cover" <ZRCC1001@SMUVM1>

Kurzweil 5100 Scanner

Today I spent three hours playing with the new Kurzweil 5100
scanner, and want to report cautious optimism. Actually, I'm trying to
provoke Terry Erdt to fulfill his promise (HUMANIST #3.21) to provide
a thorough review of the Kurzweil 5100, Accutext and Innovatic

The Kurzweil 5100 was impressive from the standpoint that it
performed a task other scanners have not achieved in our tests: it
could be trained to correctly read pointed Hebrew "characters" (text
samples taken from the Brown-Driver-Briggs lexicon) and allowed us
to map the composite characters to upper-ascii positions or to unique
(flagged) four-character string sequences. Other scanners we've
tested are not so trainable, or would not track properly with sub
scripted and super-scripted fonts of this complexity. We used the
Kurzweil 4000 to scan a Greek-English lexicon (BAGD), but the 4000
will not handle pointed Hebrew very well. Other tests we conducted
with the Chicago Assyrian Dictionary and similar multilingual lexica
suggest that the Kurzweil 5100 holds promise for scanning multilingual
texts. At $18,000, it costs about the same as three years of service
contract on the model 4000. (Will that convince our administration?
Anyone want to buy a cheap 4000?)

The Kurzweil 5100 scans at 400 dpi, but will also process text
scanned and imported from other sources (TIFF, PCX, IGF, RES). It
will output processed text to (18) popular text or wordprocessing
formats. Since processing is passed from the scanner to an IBM-PC
(equipped with a board), the user has greater flexibility in configuring
the system for batch operation and overall performance. One could
scan during the day and process the scanned files at night. By
contrast, the Kurzweil 4000 is limited by the fact that all hardware is
internal to the unit, and there is no upgrade path. The 5100 supports
most features of the 4000, but has software and user-interface
improvements that should make fast work of scanning (automatic
column recognition; auto page location [makes the "tablet"
unnecessary]; "omnifont" recognition; duplex page collation; one-pass
text-and-graphics scanning, recognition of point sizes 6-24; etc.).

The critical point of interest for academic institutions (and
HUMANISTS) will probably be the extent to which the Kurzweil 5100
can be pressed into service for multilingual scanning. I did not receive
satisfactory answers to all my questions from the local support team.
The scanner is said to support (8) modern "languages," but I think this
has little relevance for our work: only one "language" may be selected
in a single scanning operation. The 5100 features automatic
"omnifont" recognition, which means it employs generalized feature
extraction algorithms and is not constrained by the limitation of the
Kurzweil 4000, which permitted a maximum of ten discrete fonts. We
have found that the 4000's ten "fonts" (which we map to languages
*and* print styles) are not enough for complex documents. During the
training session on the 5100, unrecognized "characters" can be
mapped to special user-defined names up to (4) characters long. If
this means all 256 ASCII chars, with no practical limit (doubtful), the
new Kurzweil provides more composite- or special-character positions
than I'd ever care to train (millions!). But key assignment for several
languages might not be difficult to manage: we could use a 1-character
language code and 1-to-3-character mnemonic or transliteration
sequence for the new character name.

The header of the raw ascii output file contains information about
font assignments, tabs, print styles, column dimensions and so forth.
It's probably not meant for humans to look at (just programmers), but it
can be read. Whether these data in the header and associated output
files can be applied unambiguously with the flags *inside* the text file
to yield customized markup for multilingual text -- looks probable and
seems reasonable. I can't say more, but I'm praying.

While I should not recommend purchase of a Kurzweil 5100 based
solely on this limited testing, I certainly would recommend giving it
further evaluation. Institutions looking for a top-of-the-line scanner or
intelligent OCR software don't have much choice, though, since the
Kurzweil 4000 is no longer available. Or...is Optiram a serious
contender? The 4000 was a dead-end anyway: we uncovered
software problems that Kurzweil would not (or "could" not) fix in the
arcane spaghetti code. By contrast, the OCR software for the 5100 is
written in maintainable C-code, and improvements are already being
made. If anyone (Terry Erdt?) can obtain detailed and definitive
answers to questions about multilingual applications [esp., the number
of permissible 4-character assignments for "special characters"], I will
be grateful. Perhaps we can form a lobby to encourage Kurzweil to
support additional features in the OCR software and output formats.
If anyone is using the 5100 for multilingual applications, please
let hear your comments.

Intelligent scanning technologies, it seems to me, hold the only
realistic promise for retrieving our paper-bound literary heritage.
Keyboarding works, of course, but it's too expensive. While we await
the results of the Text Encoding Initiative to provide guidelines for
descriptive markup of our texts, and mature authoring software for
structured-editing, the technological bottleneck in data preparation is
still scanning and intelligent character recognition. The Kurzweil 5100
appears to take us one step further toward that goal.

Professor Robin Cover

(bitnet preferable) Dallas Seminary
zrcc1001@smuvm1.BITNET 3909 Swiss Avenue
attctc!utafll!robin.UUCP Dallas, TX 75204
attctc!cdword!cover.UUCP (214) 296-1783(h) 824-3094(w)