3.1061 archives

Willard McCarty (MCCARTY@vm.epas.utoronto.ca)
Thu, 15 Feb 90 20:17:47 EST

Humanist Discussion Group, Vol. 3, No. 1061. Thursday, 15 Feb 1990.

Date: Wed, 14 Feb 90 21:32:04 -0800
From: edwards%cogsci.Berkeley.EDU@jade.berkeley.edu (Jane Edwards)
Subject: transcripts on computer

[The following query is from someone about to join Humanist. Please post
all replies both to her directly and to Humanist. --W.M.]

Jane Edwards
edwards@ucbcogsc.bitnet )
I am preparing a list of computerized archives of spoken and written language
data and would like to ask your help in making it more complete. Below I
summarize the ones I know about; if you know of others, I would like very
much to hear from you.

The largest archive of language data available on computer medium seems to be
the Oxford Text Archive, which lists about 450 separate collections of written
or spoken language, including those of 4 other archives:
U. of Cambridge, U. of Pisa, U. of Pennsylvania, and Brigham Young U.
Oxford Text Archive email address: archive%vax.ox.ac.uk@ukacrl.earn (BITNET),
archive%vax.ox.ac.uk@ucl.cs.edu (EDU), archive@uk.ac.ox.vax (JANET).
Most of the Oxford holdings are written works (such as literary classics,
and Biblical works), but it also has several well-known spoken language
corpora (described below). Most are in English, but the following languages
are also represented: Arabic, Armenian, Coptic, Danish, Dutch, Finnish,
French, Fufulde, Gaelic, German, Greek, Hebrew, Icelandic, Italian, Kurdish,
Latin, Latvian, Malayan, Mayan, Pali, Portuguese, Provenc\al, Russian,
Sanskrit, Serbo-Croat, Spanish, Swedish, Turkish, and Welsh.

Of Oxford's WRITTEN ENGLISH corpora, the best known may be the BROWN CORPUS,
composed of 500 written language samples of 2000 words each from a range of
written styles of English printed in 1961 (described in Kucera & Francis,
1967, _Computational analysis of present-day American English_). This corpus
is not currently used widely in Linguistics (though perhaps in Literature, or
the Humanities) because the data are: (a) from written rather than spoken
sources, and (b) 30 years old. The large "Australian Corpus Project"
(described in Kyto, et al. (eds.), 1988, _Corpus linguistics: hard and soft_,
and in the book review in _Language_, 1989, 65(4), 843-848), may provide a
needed updated sampling of a wide range of written (Australian/British)
English, and will include also some spoken English.

The best known corpus for SPOKEN ADULT BRITISH English is probably the
London-Lund corpus (described in Svartvik & Quirk, 1980, _A corpus of spoken
English_, and Svartvik, et al., 1982, _Survey of Spoken English_), available
through the Oxford Text Archive. These data include conversations by people
of various ages, occupations, etc., recorded under various circumstances, and
have rich prosodic marking. Another large archive of spoken (British) English
is the Lancaster-Oslo-Bergen (LOB) archive (52,000 words in length, sampled
to be as close to RP as possible, rich prosodic marking), also available
through the Oxford Text Archive. The Collins-Birmingham archive includes
among other things the complete transcript of the 18-month-long inquiry into
the plan for constructing the Sizewell nuclear power station, used for the 1987
"COBUILD" ("Collins Birmingham University International Language Database")
English language dictionary.

There is as yet no archive of the size of London-Lund for SPOKEN ADULT
AMERICAN English discourse. At Berkeley, we have a collection of various
types of spoken interaction (from conversations, to the Oliver North trial,
to lectures), mostly contributed by professors here and their students. The
CHILDES archive at Carnegie-Mellon (Brian MacWhinney, brian@andrew.cmu.edu),
in addition to its child language data, distributes the written and spoken
language corpora from the CORNELL project. The spoken samples range from
abortion debates to the Patty Hearst trial to TV sit. coms. The ethno-
methodology hotline (reachable through the ComServe fileserver, or
support@rpiecs.bitnet) is sometimes used for exchange of transcripts though
not an official archive. An enormous archive is currently in the planning
stages at UC Santa Barbara, to meet the need for large-scale sampling of
discourse types, situations, etc.

Concerning PHONETICS data bases, the DARPA Speech Recognition Research
Database consists of phonetic transcriptions of sentences read aloud by
American adults from various parts of the country. Digitized versions
are also available. (see W. M. Fisher, G. R. Doddington, and K. M.
Goudie-Marshall, _Proceedings of the Speech Recognition Workshop_ Feb. 1986,
Defense Advanced Research Projects Agency, Information Processing Techniques
Office report number AD-A165 977.) A similar data base may exist at the
Advanced Telecommunications Research Institute International in Osaka, Japan
(details unknown).

The 1987 Linguistics Society of America questionnaire turned up many transcript
data sets, but only relatively few of them on computer. The trend toward
doing so is increasing, and with it, discussion of standards, normalization,
etc., and as that happens more of them may come into common domain.

In Germany, I know of two large archives (are there others?). One is in
Mannheim and contains various types of data in the German language (obtainable
through the Oxford Text Archive). The other is at Univ. of Ulm (designed and
coordinated by Erhard Mergenthaler, LU07@DMARUM8.bitnet, author of _Textbank
systems: Computer science applied in the field of psychoanalysis_ 1985), and
contains a large number of psychotherapy sessions and interviews (most in
monolingual German, some in monolingual English).

In the Netherlands (Max-Planck-Institut fuer Psycholinguistik, Nijmegen,
Helmut Feldweg, helmut@hnympi51.bitnet), there is the European Science
Foundation Second Language Data Bank, containing transcripts of 10 groups
of adult migrant workers learning the language of their resident country
(e.g., Turks learning German or Dutch, Punjabis learning English,
Moroccans learning French, Spaniards and Finns learning Swedish, etc.)
including several types of language use, gathered systematically in
three data cycles over the course of 2 1/2 years.

So, these are all of the ones that I know about. If you know of others,
or have additional information concerning those mentioned above,
I would very much appreciate hearing from you.

Jane Edwards