6.0615 Corpus Processing Software available (1/90)

Tue, 23 Mar 1993 18:08:30 EST

Humanist Discussion Group, Vol. 6, No. 0615. Tuesday, 23 Mar 1993.

Date: Tue, 23 Mar 1993 16:11:03 +0100
From: Knut Hofland <Knut.Hofland@hd.uib.no>
Subject: LEXA: Corpus processing software

Lexa, a set of programs for lexical data processing, written by Raymond Hickey,
is now available from the Norwegian Computing Centre for the Humanities for
about 100 USD.

The programs run under MS-DOS and comes on 4 diskettes with a manual
of 750 pages in 3 volumes. To get more information and order form, send the
following line to FILESERV@HD.UIB.NO

send icame lexa.info

This file can also be fetched with FTP og Gopher from nora.hd.uib.no
in the catalogue icame.

Knut Hofland
Norwegian Computing Centre for the Humanities,
Harald Haarfagres gt. 31, N-5007 Bergen, Norway
Phone: +47 5 212954/5/6, Fax: +47 5 322656, E-mail: knut@x400.hd.uib.no

Here is a short description of the programs written by the author.


Raymond Hickey,
English Department,
University of Munich,

Lexical Data Processing

The present set of programmes is intended to offer a wide range of
software which will carry out (i) the lexical analysis and (ii)
information retrieval tasks required by linguists involved in the
investigation of text corpora. The suite has been particularly adapted
to be used with the corpus of historical English compiled at the
University of Helsinki. The general nature of the software, however,
permits its application to any set of texts, particularly those which
are arranged in the so-called Cocoa format.

Lexical analysis.

The main programme, Lexa, puts at the disposal of the interested
linguist the options he or she would require in order to process
lexical data with a high degree of automation on a personal computer.
The set is divided into several groups which perform typical functions.
Of these the first, lexical analysis, will be of immediate concern.
Lexa allows one, via tagging, to lemmatise any text or series of texts
with a minimum of effort. All that is required is that the user specify
what (possible) words are to be assigned to what lemmas. The rest is
taken care of by the programme. In addition, one can create frequency
lists of the types and tokens occurring in any loaded text, make
lexical density tables, transfer textual data in a user-defined manner
to a database environment, to mention just some of the procedures which
are built into Lexa. The results of all operations are stored as files
and can be examined later, for instance with the text editor shipped
with the package. Each item of information used by Lexa when
manipulating texts is specifiable by means of a setup file which is
loaded after calling Lexa and used to initialise the programme in the
manner desired by the user.

Information retrieval.

The second main goal of the Lexa set is to offer flexible and efficient
means of retrieving information from text corpora. The programme Lexa
Pat allows one to specify a whole range of parameters for combing
through text files. By determining these precisely the user can achieve
a high level of correct returns which are of value when evaluating
texts quantitatively. A further programme, Lexa DbPat, permits similar
retrieval operations to be applied to databases, for instance those
generated by Lexa from text files of a corpus.

Ascertaining the occurrence of syntactic contexts is catered for by the
programme Lexa Context with which users can specify search strings,
their position in a sentence, the number of intervening items and then
comb through any set of texts in search of them.

By means of the utility Cocoa it is possible to group text files of a
corpus on the grounds of shared parameters from the Cocoa-format header
at the beginning of each file in many text collections, e.g. the
Helsinki corpus. All information retrieval operations can then have as
their scope those files grouped on the basis of their contents by the
Cocoa utility.

In the design of the current suite of programmes, flexibility has been
given highest priority. This is to be seen in the number of items, in
nearly all programmes, which can be determined by the user.
Furthermore, techniques have been employed which render the structure
of each programme as user-friendly as possible (pull-down menus, window
technology, mouse support, similarity of command structure between the
40-odd programmes of the set), permitting the linguist to concentrate
on essentially linguistic matters.