Ron Artstein: “The Arrau Corpus of Anaphoric Relations”

March 15, 2008 | Brigham Young University, Provo, UT

Speaker: Ron Artstein
Host: American Association of Corpus Linguistics

The Arrau corpus of anaphoric relations was created at the University of Essex between 2004 and 2007. It introduces an annotation scheme specifically targeted at marking two phenomena which had been difficult to annotate: ambiguous expressions which may refer to more than one object from previous discourse, and expressions which refer to abstract entities such as events, actions and plans. The corpus consists of a mixture of genres: task-oriented dialogues from the Trains-91 and Trains-93 corpus, narratives from the Gnome corpus and English Pear Stories corpus, and newswire from the Wall Street Journal portion of the Penn Treebank.

The corpus was created using the MMAX2 tool (Mueller and Strube 2003) which allows marking text units at different levels. Each noun phrase is marked as either anaphoric, discourse-new, or non-referential. Antecedents of anaphoric NPs are marked by pointers, and anaphoric ambiguity is indicated by multiple pointers from a single anaphoric expression (Poesio and Artstein 2005). Reference to an event, action or plan is marked by a pointer from the referring NP to the clause that introduces the abstract entity (Artstein and Poesio 2006).

The Arrau corpus differs from existing corpora like MUC and ACE since it marks all NPs, not only those that refer to entities of interest like people and organizations. The annotation is richer than a division of NPs into equivalence classes which refer to the same object, but it can be converted into equivalence classes by removing ambiguous links. The corpus has been used in the development of the anaphora resolution system at the 2007 Johns Hopkins summer workshop on natural language engineering; we plan to release it to the public in the coming months.