Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp
TL;DR
The paper addresses the challenge of lexical variation in cross-document coreference resolution (CDCR) for news by introducing a lexically-rich annotation scheme that treats coreference chains as discourse elements (DEs) and supports identity and near-identity relations. It unifies the NewsWCL50 and ECB+ annotation approaches into a single codebook, reannotates both datasets (NewsWCL50r and ECB+r), and evaluates them with lexical diversity metrics and a same-head-lemma baseline, finding that the reannotated corpora exhibit balanced diversity and moderate difficulty. Key results include substantial increases in DE granularity for NewsWCL50r and broader DEs for ECB+r, along with MTLD increases and comparable CoNLL-F1 scores (54.08 vs 52.92). The work advances discourse-aware CDCR, enabling large-scale analyses of media bias, framing, and discourse in polarized news coverage by capturing paraphrases, euphemisms, and other non-identical but contextually equivalent references.
Abstract
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.
