How to Evaluate Coreference in Literary Texts?
Ana-Isabel Duron-Tejedor, Pascal Amsili, Thierry Poibeau
TL;DR
The paper addresses evaluating coreference in literary texts, arguing that standard NLP metrics are ill-suited for long narratives. It surveys traditional link-based and mention-based metrics (MUC, BLANC, B^3, CEAF, LEA) and highlights issues such as the mention-identification effect and lack of interpretability. It shows OntoNotes, a common evaluation corpus, is poorly matched to fiction due to short text length and restricted referential forms, leading to divergent chain-length patterns from novels. The authors propose a context-aware, content-driven evaluation framework that separates long chains, singletons, and short chains to yield interpretable diagnostics and plan to validate across languages and multiple novels.
Abstract
In this short paper, we examine the main metrics used to evaluate textual coreference and we detail some of their limitations. We show that a unique score cannot represent the full complexity of the problem at stake, and is thus uninformative, or even misleading. We propose a new way of evaluating coreference, taking into account the context (in our case, the analysis of fictions, esp. novels). More specifically, we propose to distinguish long coreference chains (corresponding to main characters), from short ones (corresponding to secondary characters), and singletons (isolated elements). This way, we hope to get more interpretable and thus more informative results through evaluation.
