Humanist Discussion Group, Vol. 17, No. 761.
Centre for Computing in the Humanities, King's College London
www.kcl.ac.uk/humanities/cch/humanist/
www.princeton.edu/humanist/
Submit to: humanist@princeton.edu
Date: Sat, 03 Apr 2004 07:49:59 +0100
From: Willard McCarty <willard.mccarty@kcl.ac.uk>
Subject: a brief history of humanities computing, 1964-70
[Forwarded from Joe Raben, joeraben@cox.net]
In response to a request from Willard to inform him of the areas of
concentration that attracted the early computing humanists, I have written
a brief history of the years 1964-1970. From the perspective of more than a
generation, this early activity presents several distinctive
characteristics. Foremost is the pervasive sense expressed by the
participants that they were venturing into unknown territory with promises
of both unimagined rewards and dangers. These concerns are apparently
balanced by a hope that new challenges will produce new satisfactions, that
the existing types of scholarship require a shaking up which only the
introduction of new technology can achieve. Perhaps the most valuable
contribution they made was to establish then, when computers were thought
of almost exclusively as arithmetic machines, that this new technology
could also be used for processing words, more complex and elusive than
numbers. It may be no exaggeration to say that a major part of what
computers do today in every aspect of information technology owes a debt to
these hardy individuals who saw the verbal dimension of computing.
Years before we had organized the Association for Computers and the
Humanities or the Association for Literary and Linguistic Computing, before
an Internet facilitated the rapid, simple and relatively free communication
on which so much of the activity depends today, before Humanist facilitated
easy linkups with fellows around the globe, before databases of text
supported wide-ranging and innovative analyses, and obviously before the
invention of the desktop computer that ties almost every scholar in the
industrialized world to unimagined computing power, the pioneers whose work
is discussed here went forth, cautiously but bravely, into territory that
today has become familiar and largely recognized as legitimate. Some
knowledge of their accomplishments may serve to guide those who have followed.
For data on that period, the primary sources are the Author-Subject Index
to Computers and the Humanities (New York: Pergamon Press, 1967) and Joseph
Raben, ed., Computer-Assisted Research in the Humanities: A Directory of
Scholars Active (New York: Pergamon Press, 1977). Additional resources
include Susan Hockey, A Guide to Computer Applications in theHumanities
(London: Duckworth and Baltimore: Johns Hopkins, 1980) and Robert L.
Oakman, Computer Methods for Literary Research (Columbia: Univ. of South
Carolina Press, 1980; 2nd ed., Athens, Ga.: Univ. of Georgia Press).
Hockey's book is particularly useful for projects in Britain and Europe;
Oakman is comparably knowledgeable about those in North America. Unlike
the self-reported information in the Index and the Directory, however,
these two surveys attempt to define the investigators' intent, sometimes
not quite reaching the mark. Discussing my own effort to identify verbal
associations between Milton and Shelley, for example, Hockey assumed
(wrongly) that I had used concordances, and Oakman saw the project as
interesting from a computer science perspective but did not consider the
literary implications.
The Directory of Scholars Active is a compilation of data accumulated
during the first years of CHum, and was only partially designed to become a
historical record. While its central function was to provide suggestions
for fellow explorers in the field (instructors in many countries reported
presenting it to students to stimulate ideas for new projects), many active
scholars are known to have been neglectful or even unwilling to record
their work. A substantial number entered information that was later
recognized to be based more on hope than determination. These two resources
nevertheless constitute the best record we have of the attitudes and
expectations of the surprisingly large body of scholars who recognized in
the primitive computers of that era a potential tool for solving not only
long-standing problems, such as authorship, but also new problems suggested
by the machine's capabilities, now made available to those who enjoyed
access to a university computer or one accessible at some nearby industrial
or commercial facility.
Of major consideration in attempting even a cursory survey of this type is
the important fact that Computers and the Humanities was created to support
individuals embarking on the then-hazardous exploration of
computer-assisted humanities research. Those individuals, many of them
untenured beginners in the academic game of musical chairs, sometimes even
graduate students seeking permission to submit a computer-assisted
concordance as a dissertation, ran two risks: first, that they would be
accused of doing nothing since the computer did all the work or,
conversely, that they were embarking on expenditures of time and energy
that would produce no results of any value. The articles published in
CHum, therefore, and recorded in the Index to the first five volumes often
represent aspiration and defiance rather than the record of accomplishment
expected today in scholarly reporting.
A prime function of the journal (which was intentionally designed to
resemble any conventional academic journal) was to provide the offprints
that candidates for academic rewards could deposit on the desks of the
chairpersons and deans who made these crucial decisions. In establishing
their credibility, a cardinal rule was impartial and transparent refereeing
by their peers, who could be located for that purpose through the growing
files of the Directory. This refereeing activity itself was a mechanism for
keeping the growing community aware of new developments, even those that
were not deemed worthy of publication. To clarify the authors’ intentions
and achievements. much editorial effort had to go into rewriting
substantial portions of those articles, especially those of writers whose
native language was not English but who chose to use that medium of
communication.
The long-term goal of the two publishing ventures, the Index and the
Directory (the latter appearing in regular updates in the journal and in
the book-length accumulation), was to lend further credibility to those
disparate efforts in North America and Europe and to increase the exchange
of ideas, procedures and data among that steadily growing band of
pioneers. The appearance of this information in book format reinforced our
effort in organizing the international conferences that periodically
brought members of the community into closer contact. The success of this
editing and publishing endeavor in generating academic respectability for
such a totally innovative methodology can be measured in the number of
individuals known to me whose success in achieving tenure was based on an
article published in CHum. One of these, involved in the development of
instructional software, later became the editor of the journal. In the
tenure decision of another, who concentrated on statistical analysis of
literature, her chair actually called to find out from me whether "this was
an authentic journal."
In both of the primary sources for this report, the Author-Subject Index
to Computers and the Humanities and Computer-Assisted Research in the
Humanities: A Directory of Scholars Active, the predominant activity
recorded is the compilation of concordances. Once IBM had bundled its KWIC
(KeyWord In Context) program with the mainframes it was selling to
universities (and other customers), the idea caught on very rapidly that
this software (the term was very new and catchy at the time) could not only
create useful Indexes of the titles of scientific papers (its prime
function) but also organize the elements of a demarcated body of text, even
fiction, drama and poetry. For the first time humanists were able,
systematically and easily (they hoped) to reorganize large bodies of text
in nonlinear fashion to reveal aspects of written art that had hitherto
been essentially hidden.
Imaginations flew wild. Why not a whole series of concordances to the major
Victorian British poets? Why not concordances not only to major European
poets (Goethe, Dante, Shakespeare, Chaucer, Racine, Schiller) but also to
prose playwrights (Ibsen, Beckett) and other prose writers (Celine,
Hawthorne, La Rochefoucauld, Livy, Montaigne, Rabelais)? Why not a unified
concordance to several poets of a limited period, such as the sonneteers of
the English Renaissance? Why not a multidimensional concordance to Paradise
Lost, broken down into miniconcordances for each speaker (including the
narrator), and indications of the locales and other information that might
illuminate Milton's craft and art? Why not concord all of Milton’s prose?
Why not concord the vocal texts of Bach’s compositions? Why not a series of
concordances to all the prose fiction of Faulkner, with a grand
accumulation at the end, a compendium of a significant artist's total
literary vocabulary? Why ignore undeciphered Minoan linear texts, Hittite
cuneiform or Egyptian papyri? Why not a movement to draw attention to
Russian poets, who were largely engaged in surreptitious anti-government
propaganda, by publishing concordances to their poetry that the Communist
publishing houses would never consider?
Although handmade and conventionally printed concordances had been
available literally for centuries, their invention dating to the
Renaissance interest in biblical studies, the new computer-based
concordances (with each word's immediate environment and a citation) as
well as verbal indices (just words and citations) provided opportunities
for expanded and increasingly sophisticated research tools. For the first
time, computer-generated and -printed concordances could presumably enhance
the status of their subjects and reveal aspects of their art hitherto
unrecognizable. A series of concordances to the works of Joseph Conrad as
described in the Directory suggests some of the broadened scope envisioned:
To produce Index concordances to the complete corpus of Conrad's prose and
to develop statistical profiles of his lexicon for each work and for the
corpus as a whole; to demonstrate the usefulness of microfiche for
concordances whose bulk would argue against book format publication; to
reduce the cost of concordances to users; to prepare books for a
computer-collated edition of Conrad.
Another eager young scholar saw in the computer a means to create a
research tool of great use to a limited number of colleagues:
>To recover Cervantes' orthography and provide a KWIC old-
>spelling concordance to, and a machine-generated
>old-spelling edition of, his works. The initial concordance body to La
>Galatea and Don Quixote is complete. The project is now in its second
>phase. Multiple readings are being singled out and compositorial readings
>are being replaced by Cervantes' own spellings.
Other prophets of the new scholarly age foresaw such future advantages as
parallel concordances to similar works, the addition of "related works" to
a Bible concordance, and in anticipation of the coming interest in text
databases, "copies of the paper tapes [being] delivered to the American
Philological Association's data bank." The likelihood that paper tape
readers would soon be found only in museums did not, apparently, dampen the
enthusiasm of this convert to the religion of computer research in the
humanities.
The range of subjects and methods is startling and satisfying. Towering
over all this activity, in both conception and execution, is the Index
Thomisticus of Roberto Busa, a superconcordance initiated on Index cards
before World War II and consummated in a series of both published books and
CD-ROMs as the technology provided more and more appropriate sustenance for
this monumental analytic tool for a major thinker as well as his
sources and followers. In the academic world, moreover, with much less
support, other efforts brought out the imaginative and creative energies of
innovators. To induce the clunky chain printers of the time to produce
obsolete characters for Anglo-Saxon poetry, to reproduce the undeciphered
script of the extinct Minoan culture, even to simulate the diacritics of
French, Italian, Spanish and German plus the minor European languages--all
these technical problems drew the humanists into realms (and subterranean
computer centers) where nothing in their training or experience had led
them before.
Unlike their tradition-minded colleagues, they had to master a technology
so new that even its practitioners were largely of most of it. For the
first time, humanists were confronted with the technical infrastructure of
their institutions. Sometimes they were welcomed for providing additional
breadth to the computer centers' service; at other times and even in the
same places, they were turned away because their projects placed unusual
and unacceptable burdens on the facilities for input, processing and
output. In almost every instance they learned that the glamour of
computer-assisted literary scholarship was greatly outweighed by the
drudgery of implementing it.
Some sense of that technical drudgery facing these early computer humanists
is reflected in this description of processing punch-card input for Jess
Bessinger's "Computer-Based Concordance to the Anglo-Saxon Poetic Records":
>Objective: Exhaustive concordance to all extant Old English verse. Method:
>1) Edit for hyphenation and otherwise standardize six volumes of text, 2)
>Keypunch. 3) Transfer to tapes. 4) Sort and alphabetize. 5) Re-edit for
>homographs, etc. 6) Re-sort, re- alphabetize, and print.
Anyone who has been tortured by a keypunch, a machine never designed for
the entry of running text and on which an error--even in the last
column-required that the entire card be repunched, can appreciate the
implicit bravery in that simple operation number 2. (Bessinger and a few
other leaders of that time were already secure in their professorships.
Their commitment to the new technology may have contributed to its growing
acceptance and respectability as a legitimate academic activity.)
A major development of this melding of technology with humanistic interest
was the new effort to generate concordances to the works of minor figures
who would not in the past have justified the tedious procedure based on
hand-copied slips. The numerous entries in the Index and the Directory
include such lesser literary figures as Yeats, Henry Adams, Kafka, O'Neill,
Seneca, Baudelaire, the Beowulf poet, Hopkins, Patmore, Sidney, Keats,
Swift and Wordsworth. In addition, there were expanded developments in
other significant directions. A new concordance to Shakespeare used the
spellings of his original publications. Works composed in non-Latin
alphabets (for example, Russian, Arabic, Hebrew) were transliterated. The
computer's ability to manipulate text was exploited to produce reverse
alphabetization where a study of word-terminals was thought to illuminate
morphological or semantic elements.
In particular, the computer's superiority as a counting machine drew
substantial attention. In 1946, in my own first acquaintance with a
computer, the ENIAC at the Moore School of the University of Pennsylvania,
that monster with vacuum tubes was praised to me only for its speed at
simple arithmetic; it was running a program to calculate rocket
trajectories, which I was told would have required decades with the
previously available equipment. Since the machines were being marketed
then, to industry and commerce as well as universities, as high-speed
number crunchers, it seemed natural for many compilers of concordances to
include counts of various components, sometimes the number of each word's
occurrences and its percentage of the total body of text. Frequency counts
are reported for such standard authors as Byron, Camus, Corneille, Dante,
Gide, Roethke and Seneca. Broader efforts encompass Dutch classical
authors, German texts, Greek and Latin, Italian folklore and poetry,
Semitic languages and Swedish newspapers.
This counting function became the basis for efforts to develop a computer
stylistics. Such concepts were embodied in, for example, a project entitled
"Statische Linguistiche dell'Italiano." It would incorporate (in part):
>Testi litterari; dizionari bilingui; periodici e stampa quotidiani;
>conversazione registrate; testi psicologici e audiometrici, ecc. . . . Si
>vogliono stabilire le frequenze d'uso delle principali unità (lettere
>fonemi, sillabe, parole, construtti grammaticali, ecc.) costituenti la
>lingua italiana ai varii livelli strutturali (grafemico, fonologico,
>lessicale, morfemico, ecc.).
Another program, of more limited scope, is to be designed, in "addition to
literal counts on explicit language phenomena at the lexemic level, [to] .
. . store expectations as to certain syntactic and semantic pattern norms.
Content analysis includes both hand-Indexing and automatic recognition,"
An attempt to categorize the stylistic differences and similarities between
Gerard Manley Hopkins and Dylan Thomas would "count thirty variables (. . .
phrase and clause types, parts of speech, instances of alliteration, etc.)
and tabulating; use of multiple correlation coefficients; analysis of
results." Among other objectifiable elements of literature these
enthusiasts offered "mean sentence length, relative frequency of certain
strings of characters or rhetorical figures, etc." Or: "structural patterns
that vary in scope from the phrase to the entire performance." Or: "word,
clause, and sentence lengths, a surface-structure grammatical code,
and vocabulary distribution." Or: "to convert the orthographic to a phonetic
text and scan for patterns on assonance, alliteration, rime, and other
features associated with pre-semantic verse patterning. Also to consider
larger structures on the semantic level such as collocation of verbal
imagery."
A more specialized area of this effort is authorship attribution. Most
appropriately, in view of the concordance's early invention as a mechanism
to resolve problems with biblical texts, several projects took aim at the
traditional questions of who wrote the Pentateuch and whether the Book of
Isaiah had one or several authors. At the Technion--Israel Institute of
Technology, this second problem was considered in the light of such
criteria as "sentence length, word length, frequencies of parts of speech,
text entropy, 'special vocabulary,' facultative particles, vocabulary
eccentricity and richness, etc." From Utah came a report of a project which
ambitiously performed "statistical comparisons, based on several hundred
stylistic variables consisting of approximately seventy types of literary
elements, . . . between the book of Isaiah and other books from the Old
Testament. Inter-text variation was compared with intra-text variation
using distribution-free methods of statistical analysis to avoid mistakes
commonly made by style researchers who have used statistical procedures."
Neither report mentions whether the texts used were in Hebrew or English.
Similar efforts were made in connection with the authorship of the Odyssey.
Other authorship studies focused on Homer and on Diderot's Encyclopedie.
Occasional projects concentrated on the mechanics of computer sorting.
Philip H. Smith, Lewis Sawin and Susan Hockey produced concordance programs
that were intended for wider use; Hockey's was commercially published by
the Oxford University Press. Four listed endeavors, one each for German and
Old French, and two for Dutch, sorted their entries in reverse, in order to
group suffixes for further analysis. Several major projects used
concordances as the basis for dictionaries. Notable among these are
historical or specialized dictionaries, such as the Trésor de la langue
française, the French rival to the British OED; the Historical Dictionary
of the Italian Language of the Accademia della Crusca; the Dictionary of
American Regional English; and various more specialized dictionaries of
Old Spanish, Old Scots, and Old Dutch, plus the intellectual lexicon of
Europe. The ancient problem of collating texts, particularly those that
had been hand-copied before the invention of printing, was the goal of
several attempts.
Related activities pushed the boundaries of "humanistic" computing. Many
were linguistically oriented, especially those could more properly be
classified within realms that are not usually considered humanistic.
Associations of this type could, in my judgment, only enrich the
disciplines at both ends of the intellectual exchange. A concordance to
Calvin or Freud would introduce a new component into the study of those
texts, while humanists might well learn new methods and insights by seeing
their new tool applied to unfamiliar materials. (The Freud concordance, for
example, raised the question of whether it was appropriate to use an
English translation as the base text. Bruno Bettelheim had pointed out
that, for instance, there would be a significant distinction between the
responses of German readers who saw Mädchen as neuter, represented by the
pronoun es, and English readers who saw girl as she.) A major question, new
to humanistic studies, was whether samples (standard in more scientistic
research) could be valid in the analysis of literary texts.
Perhaps of greatest interest, now that we contemplate that scene from the
perspective of almost two generations past, is the number of significantly
missing elements. The leadership of major academic institutions, for
example, is strikingly absent. Except for an occasional association with
Cambridge or Chicago, the overwhelming majority of reporting scholars are
affiliated with less-prestigious universities. This phenomenon may be the
topic of some future study. Government support, per se, also seems to be
notable by its absence. My own efforts to identify
computer-related projects financed by the National Endowment for the
Humanities bore fruit only during a brief period when one staff member was
induced to search the records and inform me; her departure from the NEH
ended any interest there in publicizing such work. No systematic
encouragement of computer research, on the national or state level, appears
in the record.
Most unfortunate, again from my own perspective, is the absence from all
these reports of the name of Theodore F. Nelson and his epoch-making
concept of hypertext. His recognition that all texts are linked to all
others in an almost infinity of associations--verbal, thematic,
imagistic--has served for many of the commercially successful search
programs that attract increasing millions to the computer as an information
retrieval tool, but humanists seem to have been (and perhaps still may be)
ignorant of his contribution. For this, Nelson himself is partly to blame,
since he records no effort to reach out to the humanities community.
Ironically, he seems to have driven often past Queens College, where for 20
years I edited and published Computers and the Humanities, without ever
knowing that our activity would have benefited from knowledge of his work.
The one time I heard him lecture, in Los Angeles, he chose not to refer,
even in passing, to his own theories. Perhaps now, as humanities computing
has gained acceptance throughout the academic community, with humanities
support personnel attached to most computer centers, more attention will be
paid to the theoretical infrastructure that must strengthen forthcoming
melding of technological text manipulation with humanistic understanding.
A major residue of these early activities, even those that did not progress
past the early stages, is the large quantity of machine- readable text
accumulated and sometimes preserved. While there are disappointing stories
of investigators returning from leave to discover that their laboriously
input texts had been erased, there are complementary tales of large
databases designed from the start and distributed as research tools for
computer-oriented humanists. The Brown University Standard Corpus of
Edited American English was compiled by Henry Kucera and W. Nelson Francis
with the specific intention that it serve as a measure for other frequency
studies. It became the model for other corpora of greater magnitude and
sophistication. Theodore Brunner conceived and executed the design of the
Thesaurus Linguae Graecae to provide an exhaustive library of texts for the
language that was the lingua franca of the entire Mediterranean basin
until the end of the sixth century C.E. On that pattern, David W. Packard
is directing a parallel effort for Latin. Similar projects have been
designed for Chinese and Modern English.
Two of these pioneering efforts sought to organize information about
resource materials for humanistic research. One encompassed "structured
abstracts describing Bodleian Library medieval manuscript photograph
holdings (18,000 color transparencies) and retrieval of catalog and fifteen
separate indices, including three devoted to iconography." An especially
interesting project is the London Stage Data Bank, an early computerization
of an extant multivolume compilation of the programs from the four licensed
London theaters during the 140 years after they were reopened when the
Cromwell interregnum ended. Various indices and the then-novel online
access provided opportunities to trace the careers of both actors and
playwrights in a manner that could support close understanding of the
social and economic factors controlling this major institution. Under
Ben Schneider's supervision, this project grew into one of the very first
efforts to create an online database for research on the history of the
popular entertainment that both reflected and generated much of the
intellectual ambiance of that period of English history. Schneider's
account of his effort in Adventures in Computerland constitutes a prime
resource for those interested in tracing the history of our subdiscipline.
As for the projects that fell along the way, the reasons for their demise
are several and predictable. For many early enthusiasts, the actuality
sometimes proved too burdensome, especially when the output generated
seemed spawned by the sorcerer's apprentice that could not be
stopped. Often basic questions could not be answered satisfactorily, such
as how to encode nonstandard materials, or whether to reduce texts to their
lemmatized forms or process them in their original aspects. The absence of
dependable optical character recognition machines required laborious input,
a burden on scholars without grants or input skills. Lack of support and
even overt antagonism from administrators (I was advised once by my
department chair to switch from the English department to computer science,
"where [I] belonged") must have dampened the ardor of many insecure,
untenured young faculty. The rapid succession of new computers, which would
not run the older software, as well as the deaths of programming languages
like SNOBOL and PL/I, would also prove discouraging, while industry was
eager to recruit, at much better salaries, many people with even a
smattering of computer expertise.
Since few investigators have been likely to report failure or abandonment
of their projects, it would be very useful for anyone with information on
these and other projects of the period to report them to Humanist.
Likewise, information on projects not included in the Index or the
Directory would be helpful in filling out our picture of that period. Such
information can be sent to me at joeraben@cox.net and will be included in
any updates of this report that is justified by new information.
[Note: If you do not receive a reply within 24 hours please
resend.]
Dr Willard McCarty | Senior Lecturer | Centre for Computing in the
Humanities | King's College London | Strand | London WC2R 2LS || +44 (0)20
7848-2784 fax: -2980 || willard.mccarty@kcl.ac.uk
www.kcl.ac.uk/humanities/cch/wlm/
This archive was generated by hypermail 2b30 : Fri May 07 2004 - 16:58:41 EDT