20.204 a Latin treebank

Date: Wed, 20 Sep 2006

         Date: Wed, 20 Sep 2006
         From: Ross Scaife
         Subject: Latin treebank

A message received yesterday from David Bamman at Perseus and now
making the rounds of various lists:

     The Perseus Project has recently received a planning grant from
the NSF to investigate the costs and labor involved in constructing a
multimillion-word Latin treebank, along with its potential value for
the linguistics and Classics community. While our initial efforts
under this grant will focus on syntactically annotating excerpts from
Golden Age authors (Caesar, Cicero, Vergil) and the Vulgate, a future
multimillion-word corpus would be comprised of writings from the pre-
Classical period up through the Early Modern era. To date we've
annotated a total of 12,000 words in a style that's predominantly
informed by two sources: the dependency grammar used by the Prague
Dependency Treebank (itself based on Mel'cuk 1988), and the Latin
grammar of Pinkster 1990.

     While treebanks provide valuable training data for computational
tasks such as grammar induction and automatic syntactic parsing, they
also have the potential to be used in traditional research areas that
Classicists in particular are poised to exploit. Large collections of
syntactically parsed sentences have the potential to revolutionize
lexicography and philology, as they provide the immediate context for
a word's use along with its typical syntactic arguments (this lets us
chart, for example, how the meaning of a verb changes as its
predominant arguments change). Treebanks enable large-scale research
into structurally-based rhetorical devices particularly of interest to
Classicists (such as hyperbaton) and they provide the raw data for
research in historical linguistics (such as the move in Latin from
classical SOV word order to romance SVO).

     The eventual Latin treebank will be openly available to the
public; we should, therefore, come to a consensus on how it should be
built. To that end we encourage input from the linguistics and
Classics community on the treebank design (including the syntactic
representation of Latin) and welcome contributions by annotators (for
which limited funding is available). Interested collaborators should
contact David Bamman (David.Bamman_at_tufts.edu) at the Perseus Project.
