17.761 a brief history of humanities computing, 1964-70

From: Humanist Discussion Group (by way of Willard McCarty willard.mccarty@kcl.ac.uk)
Date: Fri May 07 2004 - 16:58:19 EDT

  • Next message: Humanist Discussion Group (by way of Willard McCarty

                   Humanist Discussion Group, Vol. 17, No. 761.
           Centre for Computing in the Humanities, King's College London
                       www.kcl.ac.uk/humanities/cch/humanist/
                            www.princeton.edu/humanist/
                         Submit to: humanist@princeton.edu

             Date: Sat, 03 Apr 2004 07:49:59 +0100
             From: Willard McCarty <willard.mccarty@kcl.ac.uk>
             Subject: a brief history of humanities computing, 1964-70

    [Forwarded from Joe Raben, joeraben@cox.net]

    In response to a request from Willard to inform him of the areas of
    concentration that attracted the early computing humanists, I have written
    a brief history of the years 1964-1970. From the perspective of more than a
    generation, this early activity presents several distinctive
    characteristics. Foremost is the pervasive sense expressed by the
    participants that they were venturing into unknown territory with promises
    of both unimagined rewards and dangers. These concerns are apparently
    balanced by a hope that new challenges will produce new satisfactions, that
    the existing types of scholarship require a shaking up which only the
    introduction of new technology can achieve. Perhaps the most valuable
    contribution they made was to establish then, when computers were thought
    of almost exclusively as arithmetic machines, that this new technology
    could also be used for processing words, more complex and elusive than
    numbers. It may be no exaggeration to say that a major part of what
    computers do today in every aspect of information technology owes a debt to
    these hardy individuals who saw the verbal dimension of computing.

    Years before we had organized the Association for Computers and the
    Humanities or the Association for Literary and Linguistic Computing, before
    an Internet facilitated the rapid, simple and relatively free communication
    on which so much of the activity depends today, before Humanist facilitated
    easy linkups with fellows around the globe, before databases of text
    supported wide-ranging and innovative analyses, and obviously before the
    invention of the desktop computer that ties almost every scholar in the
    industrialized world to unimagined computing power, the pioneers whose work
    is discussed here went forth, cautiously but bravely, into territory that
    today has become familiar and largely recognized as legitimate. Some
    knowledge of their accomplishments may serve to guide those who have followed.

    For data on that period, the primary sources are the Author-Subject Index
    to Computers and the Humanities (New York: Pergamon Press, 1967) and Joseph
    Raben, ed., Computer-Assisted Research in the Humanities: A Directory of
    Scholars Active (New York: Pergamon Press, 1977). Additional resources
    include Susan Hockey, A Guide to Computer Applications in theHumanities
    (London: Duckworth and Baltimore: Johns Hopkins, 1980) and Robert L.
    Oakman, Computer Methods for Literary Research (Columbia: Univ. of South
    Carolina Press, 1980; 2nd ed., Athens, Ga.: Univ. of Georgia Press).
    Hockey's book is particularly useful for projects in Britain and Europe;
    Oakman is comparably knowledgeable about those in North America. Unlike
    the self-reported information in the Index and the Directory, however,
    these two surveys attempt to define the investigators' intent, sometimes
    not quite reaching the mark. Discussing my own effort to identify verbal
    associations between Milton and Shelley, for example, Hockey assumed
    (wrongly) that I had used concordances, and Oakman saw the project as
    interesting from a computer science perspective but did not consider the
    literary implications.

    The Directory of Scholars Active is a compilation of data accumulated
    during the first years of CHum, and was only partially designed to become a
    historical record. While its central function was to provide suggestions
    for fellow explorers in the field (instructors in many countries reported
    presenting it to students to stimulate ideas for new projects), many active
    scholars are known to have been neglectful or even unwilling to record
    their work. A substantial number entered information that was later
    recognized to be based more on hope than determination. These two resources
    nevertheless constitute the best record we have of the attitudes and
    expectations of the surprisingly large body of scholars who recognized in
    the primitive computers of that era a potential tool for solving not only
    long-standing problems, such as authorship, but also new problems suggested
    by the machine's capabilities, now made available to those who enjoyed
    access to a university computer or one accessible at some nearby industrial
    or commercial facility.

    Of major consideration in attempting even a cursory survey of this type is
    the important fact that Computers and the Humanities was created to support
    individuals embarking on the then-hazardous exploration of
    computer-assisted humanities research. Those individuals, many of them
    untenured beginners in the academic game of musical chairs, sometimes even
    graduate students seeking permission to submit a computer-assisted
    concordance as a dissertation, ran two risks: first, that they would be
    accused of doing nothing since the computer did all the work or,
    conversely, that they were embarking on expenditures of time and energy
    that would produce no results of any value. The articles published in
    CHum, therefore, and recorded in the Index to the first five volumes often
    represent aspiration and defiance rather than the record of accomplishment
    expected today in scholarly reporting.

    A prime function of the journal (which was intentionally designed to
    resemble any conventional academic journal) was to provide the offprints
    that candidates for academic rewards could deposit on the desks of the
    chairpersons and deans who made these crucial decisions. In establishing
    their credibility, a cardinal rule was impartial and transparent refereeing
    by their peers, who could be located for that purpose through the growing
    files of the Directory. This refereeing activity itself was a mechanism for
    keeping the growing community aware of new developments, even those that
    were not deemed worthy of publication. To clarify the authors’ intentions
    and achievements. much editorial effort had to go into rewriting
    substantial portions of those articles, especially those of writers whose
    native language was not English but who chose to use that medium of
    communication.

    The long-term goal of the two publishing ventures, the Index and the
    Directory (the latter appearing in regular updates in the journal and in
    the book-length accumulation), was to lend further credibility to those
    disparate efforts in North America and Europe and to increase the exchange
    of ideas, procedures and data among that steadily growing band of
    pioneers. The appearance of this information in book format reinforced our
    effort in organizing the international conferences that periodically
    brought members of the community into closer contact. The success of this
    editing and publishing endeavor in generating academic respectability for
    such a totally innovative methodology can be measured in the number of
    individuals known to me whose success in achieving tenure was based on an
    article published in CHum. One of these, involved in the development of
    instructional software, later became the editor of the journal. In the
    tenure decision of another, who concentrated on statistical analysis of
    literature, her chair actually called to find out from me whether "this was
    an authentic journal."

    In both of the primary sources for this report, the Author-Subject Index
    to Computers and the Humanities and Computer-Assisted Research in the
    Humanities: A Directory of Scholars Active, the predominant activity
    recorded is the compilation of concordances. Once IBM had bundled its KWIC
    (KeyWord In Context) program with the mainframes it was selling to
    universities (and other customers), the idea caught on very rapidly that
    this software (the term was very new and catchy at the time) could not only
    create useful Indexes of the titles of scientific papers (its prime
    function) but also organize the elements of a demarcated body of text, even
    fiction, drama and poetry. For the first time humanists were able,
    systematically and easily (they hoped) to reorganize large bodies of text
    in nonlinear fashion to reveal aspects of written art that had hitherto
    been essentially hidden.

    Imaginations flew wild. Why not a whole series of concordances to the major
    Victorian British poets? Why not concordances not only to major European
    poets (Goethe, Dante, Shakespeare, Chaucer, Racine, Schiller) but also to
    prose playwrights (Ibsen, Beckett) and other prose writers (Celine,
    Hawthorne, La Rochefoucauld, Livy, Montaigne, Rabelais)? Why not a unified
    concordance to several poets of a limited period, such as the sonneteers of
    the English Renaissance? Why not a multidimensional concordance to Paradise
    Lost, broken down into miniconcordances for each speaker (including the
    narrator), and indications of the locales and other information that might
    illuminate Milton's craft and art? Why not concord all of Milton’s prose?
    Why not concord the vocal texts of Bach’s compositions? Why not a series of
    concordances to all the prose fiction of Faulkner, with a grand
    accumulation at the end, a compendium of a significant artist's total
    literary vocabulary? Why ignore undeciphered Minoan linear texts, Hittite
    cuneiform or Egyptian papyri? Why not a movement to draw attention to
    Russian poets, who were largely engaged in surreptitious anti-government
    propaganda, by publishing concordances to their poetry that the Communist
    publishing houses would never consider?

    Although handmade and conventionally printed concordances had been
    available literally for centuries, their invention dating to the
    Renaissance interest in biblical studies, the new computer-based
    concordances (with each word's immediate environment and a citation) as
    well as verbal indices (just words and citations) provided opportunities
    for expanded and increasingly sophisticated research tools. For the first
    time, computer-generated and -printed concordances could presumably enhance
    the status of their subjects and reveal aspects of their art hitherto
    unrecognizable. A series of concordances to the works of Joseph Conrad as
    described in the Directory suggests some of the broadened scope envisioned:

    To produce Index concordances to the complete corpus of Conrad's prose and
    to develop statistical profiles of his lexicon for each work and for the
    corpus as a whole; to demonstrate the usefulness of microfiche for
    concordances whose bulk would argue against book format publication; to
    reduce the cost of concordances to users; to prepare books for a
    computer-collated edition of Conrad.

    Another eager young scholar saw in the computer a means to create a
    research tool of great use to a limited number of colleagues:

    >To recover Cervantes' orthography and provide a KWIC old-
    >spelling concordance to, and a machine-generated
    >old-spelling edition of, his works. The initial concordance body to La
    >Galatea and Don Quixote is complete. The project is now in its second
    >phase. Multiple readings are being singled out and compositorial readings
    >are being replaced by Cervantes' own spellings.

    Other prophets of the new scholarly age foresaw such future advantages as
    parallel concordances to similar works, the addition of "related works" to
    a Bible concordance, and in anticipation of the coming interest in text
    databases, "copies of the paper tapes [being] delivered to the American
    Philological Association's data bank." The likelihood that paper tape
    readers would soon be found only in museums did not, apparently, dampen the
    enthusiasm of this convert to the religion of computer research in the
    humanities.

    The range of subjects and methods is startling and satisfying. Towering
    over all this activity, in both conception and execution, is the Index
    Thomisticus of Roberto Busa, a superconcordance initiated on Index cards
    before World War II and consummated in a series of both published books and
    CD-ROMs as the technology provided more and more appropriate sustenance for
    this monumental analytic tool for a major thinker as well as his
    sources and followers. In the academic world, moreover, with much less
    support, other efforts brought out the imaginative and creative energies of
    innovators. To induce the clunky chain printers of the time to produce
    obsolete characters for Anglo-Saxon poetry, to reproduce the undeciphered
    script of the extinct Minoan culture, even to simulate the diacritics of
    French, Italian, Spanish and German plus the minor European languages--all
    these technical problems drew the humanists into realms (and subterranean
    computer centers) where nothing in their training or experience had led
    them before.

    Unlike their tradition-minded colleagues, they had to master a technology
    so new that even its practitioners were largely of most of it. For the
    first time, humanists were confronted with the technical infrastructure of
    their institutions. Sometimes they were welcomed for providing additional
    breadth to the computer centers' service; at other times and even in the
    same places, they were turned away because their projects placed unusual
    and unacceptable burdens on the facilities for input, processing and
    output. In almost every instance they learned that the glamour of
    computer-assisted literary scholarship was greatly outweighed by the
    drudgery of implementing it.

    Some sense of that technical drudgery facing these early computer humanists
    is reflected in this description of processing punch-card input for Jess
    Bessinger's "Computer-Based Concordance to the Anglo-Saxon Poetic Records":

    >Objective: Exhaustive concordance to all extant Old English verse. Method:
    >1) Edit for hyphenation and otherwise standardize six volumes of text, 2)
    >Keypunch. 3) Transfer to tapes. 4) Sort and alphabetize. 5) Re-edit for
    >homographs, etc. 6) Re-sort, re- alphabetize, and print.

    Anyone who has been tortured by a keypunch, a machine never designed for
    the entry of running text and on which an error--even in the last
    column-required that the entire card be repunched, can appreciate the
    implicit bravery in that simple operation number 2. (Bessinger and a few
    other leaders of that time were already secure in their professorships.
    Their commitment to the new technology may have contributed to its growing
    acceptance and respectability as a legitimate academic activity.)

    A major development of this melding of technology with humanistic interest
    was the new effort to generate concordances to the works of minor figures
    who would not in the past have justified the tedious procedure based on
    hand-copied slips. The numerous entries in the Index and the Directory
    include such lesser literary figures as Yeats, Henry Adams, Kafka, O'Neill,
    Seneca, Baudelaire, the Beowulf poet, Hopkins, Patmore, Sidney, Keats,
    Swift and Wordsworth. In addition, there were expanded developments in
    other significant directions. A new concordance to Shakespeare used the
    spellings of his original publications. Works composed in non-Latin
    alphabets (for example, Russian, Arabic, Hebrew) were transliterated. The
    computer's ability to manipulate text was exploited to produce reverse
    alphabetization where a study of word-terminals was thought to illuminate
    morphological or semantic elements.

    In particular, the computer's superiority as a counting machine drew
    substantial attention. In 1946, in my own first acquaintance with a
    computer, the ENIAC at the Moore School of the University of Pennsylvania,
    that monster with vacuum tubes was praised to me only for its speed at
    simple arithmetic; it was running a program to calculate rocket
    trajectories, which I was told would have required decades with the
    previously available equipment. Since the machines were being marketed
    then, to industry and commerce as well as universities, as high-speed
    number crunchers, it seemed natural for many compilers of concordances to
    include counts of various components, sometimes the number of each word's
    occurrences and its percentage of the total body of text. Frequency counts
    are reported for such standard authors as Byron, Camus, Corneille, Dante,
    Gide, Roethke and Seneca. Broader efforts encompass Dutch classical
    authors, German texts, Greek and Latin, Italian folklore and poetry,
    Semitic languages and Swedish newspapers.

    This counting function became the basis for efforts to develop a computer
    stylistics. Such concepts were embodied in, for example, a project entitled
    "Statische Linguistiche dell'Italiano." It would incorporate (in part):

    >Testi litterari; dizionari bilingui; periodici e stampa quotidiani;
    >conversazione registrate; testi psicologici e audiometrici, ecc. . . . Si
    >vogliono stabilire le frequenze d'uso delle principali unità (lettere
    >fonemi, sillabe, parole, construtti grammaticali, ecc.) costituenti la
    >lingua italiana ai varii livelli strutturali (grafemico, fonologico,
    >lessicale, morfemico, ecc.).

    Another program, of more limited scope, is to be designed, in "addition to
    literal counts on explicit language phenomena at the lexemic level, [to] .
    . . store expectations as to certain syntactic and semantic pattern norms.
    Content analysis includes both hand-Indexing and automatic recognition,"

    An attempt to categorize the stylistic differences and similarities between
    Gerard Manley Hopkins and Dylan Thomas would "count thirty variables (. . .
    phrase and clause types, parts of speech, instances of alliteration, etc.)
    and tabulating; use of multiple correlation coefficients; analysis of
    results." Among other objectifiable elements of literature these
    enthusiasts offered "mean sentence length, relative frequency of certain
    strings of characters or rhetorical figures, etc." Or: "structural patterns
    that vary in scope from the phrase to the entire performance." Or: "word,
    clause, and sentence lengths, a surface-structure grammatical code,
    and vocabulary distribution." Or: "to convert the orthographic to a phonetic
    text and scan for patterns on assonance, alliteration, rime, and other
    features associated with pre-semantic verse patterning. Also to consider
    larger structures on the semantic level such as collocation of verbal
    imagery."

    A more specialized area of this effort is authorship attribution. Most
    appropriately, in view of the concordance's early invention as a mechanism
    to resolve problems with biblical texts, several projects took aim at the
    traditional questions of who wrote the Pentateuch and whether the Book of
    Isaiah had one or several authors. At the Technion--Israel Institute of
    Technology, this second problem was considered in the light of such
    criteria as "sentence length, word length, frequencies of parts of speech,
    text entropy, 'special vocabulary,' facultative particles, vocabulary
    eccentricity and richness, etc." From Utah came a report of a project which
    ambitiously performed "statistical comparisons, based on several hundred
    stylistic variables consisting of approximately seventy types of literary
    elements, . . . between the book of Isaiah and other books from the Old
    Testament. Inter-text variation was compared with intra-text variation
    using distribution-free methods of statistical analysis to avoid mistakes
    commonly made by style researchers who have used statistical procedures."
    Neither report mentions whether the texts used were in Hebrew or English.
    Similar efforts were made in connection with the authorship of the Odyssey.

    Other authorship studies focused on Homer and on Diderot's Encyclopedie.
    Occasional projects concentrated on the mechanics of computer sorting.
    Philip H. Smith, Lewis Sawin and Susan Hockey produced concordance programs
    that were intended for wider use; Hockey's was commercially published by
    the Oxford University Press. Four listed endeavors, one each for German and
    Old French, and two for Dutch, sorted their entries in reverse, in order to
    group suffixes for further analysis. Several major projects used
    concordances as the basis for dictionaries. Notable among these are
    historical or specialized dictionaries, such as the Trésor de la langue
    française, the French rival to the British OED; the Historical Dictionary
    of the Italian Language of the Accademia della Crusca; the Dictionary of
    American Regional English; and various more specialized dictionaries of
    Old Spanish, Old Scots, and Old Dutch, plus the intellectual lexicon of
    Europe. The ancient problem of collating texts, particularly those that
    had been hand-copied before the invention of printing, was the goal of
    several attempts.

    Related activities pushed the boundaries of "humanistic" computing. Many
    were linguistically oriented, especially those could more properly be
    classified within realms that are not usually considered humanistic.
    Associations of this type could, in my judgment, only enrich the
    disciplines at both ends of the intellectual exchange. A concordance to
    Calvin or Freud would introduce a new component into the study of those
    texts, while humanists might well learn new methods and insights by seeing
    their new tool applied to unfamiliar materials. (The Freud concordance, for
    example, raised the question of whether it was appropriate to use an
    English translation as the base text. Bruno Bettelheim had pointed out
    that, for instance, there would be a significant distinction between the
    responses of German readers who saw Mädchen as neuter, represented by the
    pronoun es, and English readers who saw girl as she.) A major question, new
    to humanistic studies, was whether samples (standard in more scientistic
    research) could be valid in the analysis of literary texts.

    Perhaps of greatest interest, now that we contemplate that scene from the
    perspective of almost two generations past, is the number of significantly
    missing elements. The leadership of major academic institutions, for
    example, is strikingly absent. Except for an occasional association with
    Cambridge or Chicago, the overwhelming majority of reporting scholars are
    affiliated with less-prestigious universities. This phenomenon may be the
    topic of some future study. Government support, per se, also seems to be
    notable by its absence. My own efforts to identify
    computer-related projects financed by the National Endowment for the
    Humanities bore fruit only during a brief period when one staff member was
    induced to search the records and inform me; her departure from the NEH
    ended any interest there in publicizing such work. No systematic
    encouragement of computer research, on the national or state level, appears
    in the record.

    Most unfortunate, again from my own perspective, is the absence from all
    these reports of the name of Theodore F. Nelson and his epoch-making
    concept of hypertext. His recognition that all texts are linked to all
    others in an almost infinity of associations--verbal, thematic,
    imagistic--has served for many of the commercially successful search
    programs that attract increasing millions to the computer as an information
    retrieval tool, but humanists seem to have been (and perhaps still may be)
    ignorant of his contribution. For this, Nelson himself is partly to blame,
    since he records no effort to reach out to the humanities community.
    Ironically, he seems to have driven often past Queens College, where for 20
    years I edited and published Computers and the Humanities, without ever
    knowing that our activity would have benefited from knowledge of his work.
    The one time I heard him lecture, in Los Angeles, he chose not to refer,
    even in passing, to his own theories. Perhaps now, as humanities computing
    has gained acceptance throughout the academic community, with humanities
    support personnel attached to most computer centers, more attention will be
    paid to the theoretical infrastructure that must strengthen forthcoming
    melding of technological text manipulation with humanistic understanding.

    A major residue of these early activities, even those that did not progress
    past the early stages, is the large quantity of machine- readable text
    accumulated and sometimes preserved. While there are disappointing stories
    of investigators returning from leave to discover that their laboriously
    input texts had been erased, there are complementary tales of large
    databases designed from the start and distributed as research tools for
    computer-oriented humanists. The Brown University Standard Corpus of
    Edited American English was compiled by Henry Kucera and W. Nelson Francis
    with the specific intention that it serve as a measure for other frequency
    studies. It became the model for other corpora of greater magnitude and
    sophistication. Theodore Brunner conceived and executed the design of the
    Thesaurus Linguae Graecae to provide an exhaustive library of texts for the
    language that was the lingua franca of the entire Mediterranean basin
    until the end of the sixth century C.E. On that pattern, David W. Packard
    is directing a parallel effort for Latin. Similar projects have been
    designed for Chinese and Modern English.

    Two of these pioneering efforts sought to organize information about
    resource materials for humanistic research. One encompassed "structured
    abstracts describing Bodleian Library medieval manuscript photograph
    holdings (18,000 color transparencies) and retrieval of catalog and fifteen
    separate indices, including three devoted to iconography." An especially
    interesting project is the London Stage Data Bank, an early computerization
    of an extant multivolume compilation of the programs from the four licensed
    London theaters during the 140 years after they were reopened when the
    Cromwell interregnum ended. Various indices and the then-novel online
    access provided opportunities to trace the careers of both actors and
    playwrights in a manner that could support close understanding of the
    social and economic factors controlling this major institution. Under
    Ben Schneider's supervision, this project grew into one of the very first
    efforts to create an online database for research on the history of the
    popular entertainment that both reflected and generated much of the
    intellectual ambiance of that period of English history. Schneider's
    account of his effort in Adventures in Computerland constitutes a prime
    resource for those interested in tracing the history of our subdiscipline.

    As for the projects that fell along the way, the reasons for their demise
    are several and predictable. For many early enthusiasts, the actuality
    sometimes proved too burdensome, especially when the output generated
    seemed spawned by the sorcerer's apprentice that could not be
    stopped. Often basic questions could not be answered satisfactorily, such
    as how to encode nonstandard materials, or whether to reduce texts to their
    lemmatized forms or process them in their original aspects. The absence of
    dependable optical character recognition machines required laborious input,
    a burden on scholars without grants or input skills. Lack of support and
    even overt antagonism from administrators (I was advised once by my
    department chair to switch from the English department to computer science,
    "where [I] belonged") must have dampened the ardor of many insecure,
    untenured young faculty. The rapid succession of new computers, which would
    not run the older software, as well as the deaths of programming languages
    like SNOBOL and PL/I, would also prove discouraging, while industry was
    eager to recruit, at much better salaries, many people with even a
    smattering of computer expertise.

    Since few investigators have been likely to report failure or abandonment
    of their projects, it would be very useful for anyone with information on
    these and other projects of the period to report them to Humanist.
    Likewise, information on projects not included in the Index or the
    Directory would be helpful in filling out our picture of that period. Such
    information can be sent to me at joeraben@cox.net and will be included in
    any updates of this report that is justified by new information.

                [Note: If you do not receive a reply within 24 hours please
    resend.]
    Dr Willard McCarty | Senior Lecturer | Centre for Computing in the
    Humanities | King's College London | Strand | London WC2R 2LS || +44 (0)20
    7848-2784 fax: -2980 || willard.mccarty@kcl.ac.uk
    www.kcl.ac.uk/humanities/cch/wlm/



    This archive was generated by hypermail 2b30 : Fri May 07 2004 - 16:58:41 EDT