Date: Thu, 9 Jul 1992 17:04 EDT
Subject: Text Projects in the Humanities

In response to Professor Maurizio Lana's inquiry about information resources
for electronic text projects in the humanities, following is some
background on the Georgetown Center for Texts and Technology's
catalogue of projects. I plan to send Dr. Lana a copy of our list of
electronic text projects, and invite any other interested persons to
contact me, MFRIEDMAN@guvax.georgetown.edu., either with information on
projects or to receive a copy of the list.

Since April of 1989, the Center for Text & Technology (CTT), under
the aegis of the Academic Computer Center at Georgetown
University, has been compiling a catalogue of projects that create
and analyze electronic text in the humanities. The Georgetown
University Catalogue of Projects in Electronic Text is a powerful
database that includes information on electronic text projects
throughout the world. The database includes a variety of
information on the many collections of literary works, historical
documents, and linguistic data which are available from commercial
vendors and scholarly sources. The database is written in Ingres
and resides on a VAX 8700 computer at Georgetown University. The
database may be searched by off-campus users who can connect to
the database using Telnet or a modem.

The electronic text projects documented in the database are
machine-readable files of primary materials from humanities
disciplines. Whether entered by keyboarding or by scanning with
an optical character reader, these text files generally take the
form either of large corpora for linguistic analysis (such as the
new British National Corpus of one million words currently being
developed by Oxford University Press and others) or major works of
major authors for analysis of style and content (such as the
compact disc of the Thesaurus Linguae Graecae containing 1400
years of classical Greek texts). The catalogue does not include
electronic versions of encyclopedias, dictionaries, and secondary
studies as well as concordances, databases, and computer-assisted
instruction programs that do not contain full-text versions of
primary works as these materials are beyond the scope of this

Unlike the databases that research libraries often make available,
the electronic texts cataloged at Georgetown are intended by their
developers to be searched and manipulated directly by humanists.
Often, therefore, the text is encoded with markup language to
facilitate integration with other files; occasionally, the texts
are combined with a commercial text-analysis tool such as
WordCruncher, Folio Views, or Micro-OCP.

With electronic text and integrated analysis software, the
researcher not only has the equivalent of an interactive
concordance for finding instances of key words but can also search
for clusters of words, exact phrases, and co-occurrences of key
words (sorted by boolean operators) in contexts of various sizes.

Statistical programs show where the desired term or concept is
concentrated in a work or series of works, and parsing programs
can analyze parts of speech and syntactic structures.

In general, therefore, the combination of electronic text and
searching software can be said to provide the researcher with both
microscopic and macroscopic views of the text. The former
provides access to small-scale features of a single work; for
example, within seconds, a philosopher could locate the single
occurrence of the phrase "consciousness of absolute being" from
the nine-megabyte, three-volume translation of Hegel's Lectures on
the Philosophy of Religion. By contrast, the macroscopic view of
the text highlights the ways in which one work differs from other
works by the same author or the author's contemporaries; for
example, if one searches an eleven-megabyte file of Shakespeare's
works for the word 'time,' one finds a greater concentration in
Macbeth than in the other tragedies, and by exploring the
contexts, one can see how the title character's over-reaching can
be explained thematically in terms of his attempt to usurp the
providential function that belongs to time.

Given these advantages, it is not surprising that the conversion
of primary texts to electronic form is proliferating throughout
the world. Nevertheless, because of the unpublicized academic
nature of such projects, the process of locating them can be
difficult. For this reason, we rely heavily on the discussion
groups on BlTNET and Internet, not only to identify new projects
but also to request information about them and to disseminate the
material we compile. Electronic mail provides access to the most
recent developments and permits us to receive and transmit
information throughout the world quickly and economically. Among
the sixty discussion groups we monitor are those in language
and literature (Ansax-L, C18-L, Chaucer, Contex-L, English,
Ficino, Linguist, Litera-L, Literary, Reed-L, Rustex-L, Shakesper,
and Wwp- L), culture and religion (Ccnet-L, Indology, Japan,
Judaica, and Religion), libraries (Cdrom-L, Fisc-L, Libref-L,
Pacs-L, and Tei-L), philosophy and history (History, Philos-L, ad
Philosop), and the humanities in general (Erl-L, Gutnberg,
Humanist, and Pmc-Talk).

In our search for news of projects, we also review a wide range of
publications, including popular magazines and newspapers (such as
the Chronicle of Higher Education), agency reports (such as the
List of Awards of the National Endowment for the Humanities),
trade publications (including InfoWorld and EDU Magazine),
discipline specific journals (such as Computers and Philosophy and
Computers and the Classics), the newsletters of numerous academic
computing centers, and the journals central to humanities
computing (Computers and the Humanities, Bits and Bytes Review,
and the ICAME Journal).

Once we have identified a new project, we request ten categories
of information:

0. Identifying acronym or short reference;

1. Name and affiliation of operation (including collaborators)
with references toany published description;

2. Contact person and/or vendor with addresses;

3. Primary disciplinary focus (and secondary interests);

4. Focus: time period, geographical area, or individual;

5. Language(s) coded;

6. Intended use(s) and Size (number of works, or entries, or

7. File format(s);

8. Form(s) of access (outline, tape, diskette, CD-ROM, etc.);

9. Source(s) of the archival holdings: encoded in-house, or
obtained from elsewhere.

Because the catalogue is constantly being updated, any printing
would be almost immediately obsolete. Consequently, the CTT has
converted the catalog to an online database searchable through
Telnet and dial-in access so that current information can be
made available to researchers. In addition, searches of the
catalogue are performed on request, and updated lists of projects
and addresses are posted regularly on the HUMANIST electronic
bulletin board and distributed through surface and electronic

For further information about the project, or to request a
specific search, please contact:

Margaret Friedman, Project Assistant
The Center for Text and Technology
Academic Computer Center
238 Reiss Science Building
Georgetown University
Washington, DC 20057

(202) 687-6096
BITNET: mfriedman@guvax Internet: mfriedman@guvax.georgetown.edu