9.328 programming

Humanist (mccarty@phoenix.Princeton.EDU)
Tue, 28 Nov 1995 21:33:58 -0500 (EST)

Humanist Discussion Group, Vol. 9, No. 328.
Center for Electronic Texts in the Humanities (Princeton/Rutgers)

[1] From: Nancy Ide <ide@univ-aix.fr> (105)
Subject: programming -- some (old) ideas

The recent discussion on programming has brought to mind some of the
discussion and arguments Jean Veronis and I made in a 1993 short paper, a
version of which appeared in the ACH Newsletter (Winter, 1993). Our ideas
(and experience!) have evolved considerably since then, mostly in
connection with our work in a European project that is developing text
software. However, parts of the article seem worth repeating as input to
the current discussion.

Realizing that not all of you have your 1993 copies of the ACH Newsletter
on hand, I append here an excerpt from our earlier article as additional
input to the discussion, for what it is worth!

Despite some changes in our view, the basic architecture of the tools we
are developing follows fairly closely on the approach outlined here. So
far, it seems to live up to most of the claims we made two years ago. Time
will tell the full story!

Nancy Ide



Existing text-analytic software (including the small body of
publicly marketed software as well as privately developed
software) consists mainly of integrated systems having one
large, complex piece of code that handles all the
functionality. We believe that this approach to the design of
text analytic software is inappropriate for the needs of the
research community.

First, the change in the scale of usable text data, from the
order of a few million words to tens or hundreds of millions,
is likely to affect both what language researchers do and how
they do it. It is difficult to foresee the functions that
text analysis will require in the future. We are certainly at
the beginning of an era of exploration and experimentation,
which will require flexible software that can be adapted to
one-time or new applications. The functionality of large
integrated systems is usually fixed and unmodifiable.

Second, large integrated systems require long periods for
their development, which means that new software is not going
to appear overnight. In addition, development costs are very
high, which prohibits free (or cheap) distribution. Low
software cost is especially important in a field where many
researchers have very little grant money or institutional
funds for software purchase.

Finally, large integrated systems either provide the
functions you need, or it is necessary to create your own
program from scratch in a low-level language such as C. For
many researchers without formal computer science training,
and with limited development resources, writing large
programs from scratch is impractical. Even for those who can,
there is often much duplication of effort. It is not at all
uncommon for tailor-made systems to replicate much of the
functionality of similar systems and in turn create programs
that cannot be re-used by others, and so on in an endless
software waste cycle. The reusability of data is a much-
discussed topic these days; similarly, we need "software
reusability", to avoid the re-inventing of the wheel
characteristic of much language-analytic research in the past
three decades.


We believe that the research community would be best served
by a set of small tools that scholars can use alone or
combine to create larger, more complex programs. By small, we
mean very, very small--often on the order of a few lines of
code, with the absolute minimum of functionality. Functions
can be "bundled" to perform more complex tasks, as needed. In
this way, increasingly complex program bundles can be
developed without the overhead of large system design, and
with ease of modification since any program can be de-bundled
into its constituent programs, each consisting of small,
easily understandable piece of code. The use of small
functions allows scholars to concentrate on the underlying
logic in order to develop new functions, and eliminates the
need for the training and/or time to implement algorithms at
lower levels. The development of increasingly complex bundles
provides good, reliable software for scholars who want a
ready-made high-level tool.

For example, consider the task of computing tag collocation
tables (tables indicating how many times a given tag was
followed by another) for a corpus tagged for part of speech.
Trying to do it with one single program demands complex data
structuring and data access methods (hash tables, etc.) and
constitutes a substantial programming project. However
(assuming a two-column, word-tag format for the corpus), the
task can be accomplished by combining four simple functions
(extract the tag column, output tag/nextag pairs for whole
text, sort the pairs, count duplicates), and the programming
effort becomes trivial--so trivial that three of the steps
are standard commands in many operating systems. The only
"complicated" step is outputting tag/nextag pairs, which can
be accomplished with less than five lines in any programming
language, which can be easily debugged and tested.


The paradigm we have outlined has clear advantages for the
development of text analytic software. Most importantly, it
enables a <emph>distributed</emph> software development effort to which
anyone can contribute. Since modules are extremely small,
individuals with small systems are just as capable of
contributing as large teams. The distribution of effort also
means that there is no large investment as with integrated
systems, which in turn means that software can be distributed
cheaply, or for free. In addition, this approach enables what
we call "software evolution". There is no need to envision
the entire functionality of the system at the outset;
instead, because extension is trivial (and can re-use
existing functions or bundles), the system can easily grow
how and when there is a demand. This is especially important
in a field where special and one-time use tools are common,
and new functions are likely to be desired.