The Arrau Corpus of Anaphoric Relations
Ron Artsein
ICT, co-author - Massimo Poesio, University of Essex and University of Trento
Date: Saturday, March 15, 2008
Time: 11:30 am, March 15, 2008
Location: 3712 Lee Library, Brigham Young University, Provo Utah
Host: American Association of Corpus Linguistics
The Arrau corpus of anaphoric relations was created at the University of Essex between
2004 and 2007. It introduces an annotation scheme specifically targeted at marking two
phenomena which had been difficult to annotate: ambiguous expressions which may refer
to more than one object from previous discourse, and expressions which refer to abstract
entities such as events, actions and plans. The corpus consists of a mixture of genres: taskoriented
dialogues from the Trains-91 and Trains-93 corpus, narratives from the Gnome
corpus and English Pear Stories corpus, and newswire from the Wall Street Journal portion
of the Penn Treebank.
The corpus was created using the MMAX2 tool (Mueller and Strube 2003) which allows
marking text units at different levels. Each noun phrase is marked as either anaphoric,
discourse-new, or non-referential. Antecedents of anaphoric NPs are marked by pointers,
and anaphoric ambiguity is indicated by multiple pointers from a single anaphoric
expression (Poesio and Artstein 2005). Reference to an event, action or plan is marked by
a pointer from the referring NP to the clause that introduces the abstract entity (Artstein
and Poesio 2006).
The Arrau corpus differs from existing corpora like MUC and ACE since it marks all
NPs, not only those that refer to entities of interest like people and organizations. The annotation
is richer than a division of NPs into equivalence classes which refer to the same
object, but it can be converted into equivalence classes by removing ambiguous links. The
corpus has been used in the development of the anaphora resolution system at the 2007
Johns Hopkins summer workshop on natural language engineering; we plan to release it
to the public in the coming months.