Humanist Discussion Group, Vol. 15, No. 487.
Centre for Computing in the Humanities, King's College London
<http://www.princeton.edu/~mccarty/humanist/>
<http://www.kcl.ac.uk/humanities/cch/humanist/>
[1] From: Angela Mattiacci <amattiac@uottawa.ca> (32)
Subject: Canadian Century Research Infrastructure
[2] From: Magali Duclaux <duclaux@elda.fr> (133)
Subject: ELRA News
--[1]------------------------------------------------------------------
Date: Tue, 05 Feb 2002 08:10:04 +0000
From: Angela Mattiacci <amattiac@uottawa.ca>
Subject: Canadian Century Research Infrastructure
OTTAWA, January 31, 2002 The Canadian Century Research Infrastructure
(CCRI), a pan-Canadian research project, will benefit from the Canada
Foundation for Innovation's latest round of funding.
Industry minister Allan Rock released a list of 280 Canadian projects
that will receive CFI grants yesterday. To receive funding, applicants had
to demonstrate the
excellence and innovative nature of their projects and how they will
benefit Canada.
"Our recent success in the Innovation competitions coupled with our 100
per cent success rate in the New Opportunities program clearly
establishes the University of Ottawa as one of Canada's leading
research-intensive universities," noted rector Gilles Patry.
The Canadian Century Research Infrastructure will receive $5.2M. With
the matching funds from each province and thanks to the contributions of
our partners, a total of $13.4M will be
invested in this project.
The project leader of the CCRI is Chad Gaffield, Director of the
Institute of Canadian Studies and Professor of History at the University
of Ottawa. Headquarters for the CCRI will be located at the University
of Ottawa with partners in the following universities: Memorial
University of Newfoundland, Universit Laval, Universit de Qubec
Trois-Rivires, York University, University of Toronto, and University
of Victoria.
Canada Century Research Infrastructure
One of the largest social science projects ever funded by CFI, the
Canada Century Research Infrastructure will create a series of databases
from census records covering a century of Canadian life. The databases
will allow researchers to examine social structures and how they have
changed in detail that until now was simply not available. The CCRI will
spark bold and creative new approaches to the study of Canada in
universities across the country and around the world.
For more information, please contact the Institute of Canadian Studies
at canada@uottawa.ca or phone (613) 562-5111
--[2]------------------------------------------------------------------
Date: Tue, 05 Feb 2002 08:10:55 +0000
From: Magali Duclaux <duclaux@elda.fr>
Subject: ELRA News
************************************************************
ELRA - European Language Resources Association
************************************************************
We are pleased to announce some new resources
available in our catalogue of language resources:
S0119 Spanish SpeechDat Database for the Mobile Telephone Network
W0032 Modern French Corpus including Anaphors Tagging
W0033 CRATER 2
A short description of these three new resources is given
below. Please visit the online catalogue to get further details:
http://www.elda.fr/catalog.html
S0119 Spanish SpeechDat Database for the Mobile Telephone Network
***********************************************************************************
The Spanish SpeechDat database for the mobile telephone network
comprises 1066 Spanish speakers (526 males, 540 females) calling
from GSM telephones and recorded over the fixed PSTN using and
ISDN-BRI interface. The database was produced by Applied Technologies
in Language and Speech S.L. (Spain). The MDB-1000 database is
partitioned into 6 CDs in ISO 9660 format. This database follows the
specifications given in the framework of the SpeechDat(II) project.
Speech samples are stored as sequences of 8-bit 8 kHz A-law.
Each prompted utterance is stored in a separate file. Each signal file
is accompanied by an ASCII SAM label file which contains the relevant
descriptive information.
Each speaker uttered the following items:
2 isolated digits.
1 sequence of 10 isolated digits.
4 connected digits: 1 sheet number (6 digits), 1 telephone number
(9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits).
3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date
(word style), 1 relative and general date expression.
1 word spotting phrase using an application word (embedded).
6 application words.
3 spelled words: 1 spontaneous name (own forename), 1 city
name, 1 real / artificial word for coverage.
1 currency money amount.
1 natural number.
6 directory assistance names: 1 surname (set of 500), 1 city of
birth / growing up, 1 most frequent cities (set of 500), 1 most frequent
company / agency (set of 500), 1 forename surname (set of 150), 1
spontaneous forename.
2 questions including fuzzy yes / no: 1 predominantly Yes question,
1 predominantly No question.
9 phonetically rich sentences.
2 time phrases: 1 time of day (spontaneous), 1 time phrase (word
style).
4 phonetically rich words.
Call environment.
The following age distribution has been obtained: 5 speaker are below 16
years old, 543 speakers are between 16 and 30, 307 speakers are
between 31 and 45, 202 speakers are between 46 and 60, 9 speakers are
over 60. A pronunciation lexicon with a phonemic transcription in SAMPA is
also included.
W0032 Modern French Corpus including Anaphors Tagging
********************************************************************
The corpus that includes the tagging of the anaphors was created by
the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team
and XRCE (Xerox Research Centre Europe, France) in the framework of
the call launched by the DGLF-LF (national institution for the French
language and the languages spoken in France), for the creation of modern
French corpora).
Over 1 million words have been annotated. The corpora have been selected
so that they represent a wide sampling of the French language (scientific
and human science articles, extracts from newspapers and magazines,
legal texts, etc.) and according to the points of interest of the teams working
on the project. The processed corpora supplied by ELRA are listed below:
- Two books edited by the CNRS: La protection des oeuvres scientifiques
en droit d'auteur franais, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591
words) and Cinquante ans de traction la SNCF. Enjeux politiques, conomiques
et rponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990
words).
- 204 articles extracted from CNRS Info, a magazine which contains short
popular scientific articles from the CNRS laboratories (201 280 words).
- 14 articles dealing with Herms Human Sciences (111 886 words).
- 136 articles extracted from "Le Monde", dealing with economics (roughly
180 760 words).
- 13 booklets of the Official Journal of the European Communities
(roughly
337 000 words).
Below the tagged anaphoric elements:
- Person pronouns: 3rd person pronoun, anaphoric.
- Possessive determiners: 3rd person possessive determiner.
- Demonstrative pronouns: anaphoric pronouns (celui, celle, ceux,
celles-ci,
celles-l)
- Indefinite pronouns: Aucun(e), chacun(e), certain(e)s, l'un(e), les
un(e)s,
tout(es), etc, when they are anaphoric.
- "Proverbs": "le" + "faire".
- Anaphoric and cataphoric adverbs: Dessus, dedans, dessous , when
they have an anaphoric function.
- Ellipsis of head nouns: Nominal adjectives or quantifiers determiners
ellipsis.
- Textual headers like "ce dernier": Ce dernier, le premier , etc.
The annotation scheme was defined in XML format. The texts were divided
into sections, paragraphs (<p>) and sentences (<s>). The sentence
segmentation was carried out with
NLP tools developed by XRCE, the annotation part was done manually by two
qualified linguists. A large subset of anaphoric phrases was automatically
pre-annotated. The antecedents and the tagging of the anaphoric relations
were manually processed, but editing tools (emacs, macros from Author/Editor
software) were used to make it easier. 5% of the corpora were evaluated to
check
the annotation reliability.
W0033 CRATER 2
**********************
The CRATER corpus was built upon the foundations of an earlier project,
ET10/63, which was funded in the final phase of the Eurotra programme.
The Corpus Resources and Terminology Extraction project (MLAP-93 20)
extended the bilingual annotated English-French International
Telecommunications
Union corpus produced within ET10/63 to include Spanish.
The CRATER 2 corpus was produced by the Department of Linguistics & Modern
English Language, Lancaster University (United Kingdom) with funding from
ELRA. The ELRA funding in turn was provided by the European Commission
project LRsP&P (Language Resources Production & Packaging - LE4-8335).
This project has enhanced the CRATER corpus, available under the reference
ELRA-W0003 in the ELRA catalogue. CRATER 2 has significantly expanded
the French/English component of the parallel corpus by increasing the size
of the English/French corpus from 1,000,000 words per language to
approximately 1,500,000 tokens per language. CRATER 2 is sold with CRATER
in a single package.
=====================================
For further information, please contact:
ELRA/ELDA
55-57 rue Brillat-Savarin
F-75013 Paris, France
Tel: +33 01 43 13 33 33
Fax: +33 01 43 13 33 30
E-mail mapelli@elda.fr
or visit our Web site:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================
This archive was generated by hypermail 2b30 : Tue Feb 05 2002 - 03:36:22 EST