GUMBridge: a Corpus for Varieties of Bridging Anaphora
Lauren Levine, Amir Zeldes
TL;DR
GUMBridge addresses the fragmented landscape of bridging anaphora resources by introducing a large, genre-diverse corpus with fine-grained, multi-subtype annotations for bridging instances. Built on GUM v11, it contains 2,268 bridging instances over 127k tokens across 16 genres and supports 11 subtypes across three main categories, enabling multi-label annotation per instance. The authors demonstrate annotation reliability through an inter-annotator study and benchmark contemporary LLMs on bridging resolution and subtype classification, finding these tasks remain challenging. Results indicate potential for LLM-assisted improvement but underscore the need for more robust modeling and cross-linguistic extension. Overall, GUMBridge provides a rich resource for linguistics and NLP tasks requiring nuanced bridging analysis and cross-genre evaluation.
Abstract
Bridging is an anaphoric phenomenon where the referent of an entity in a discourse is dependent on a previous, non-identical entity for interpretation, such as in "There is 'a house'. 'The door' is red," where the door is specifically understood to be the door of the aforementioned house. While there are several existing resources in English for bridging anaphora, most are small, provide limited coverage of the phenomenon, and/or provide limited genre coverage. In this paper, we introduce GUMBridge, a new resource for bridging, which includes 16 diverse genres of English, providing both broad coverage for the phenomenon and granular annotations for the subtype categorization of bridging varieties. We also present an evaluation of annotation quality and report on baseline performance using open and closed source contemporary LLMs on three tasks underlying our data, showing that bridging resolution and subtype classification remain difficult NLP tasks in the age of LLMs.
