LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition
Xiaomeng Zhu, Robert Frank
TL;DR
LIEDER introduces a linguistically informed evaluation framework for Discourse Entity Recognition, targeting four semantic properties—existence, uniqueness, plurality, and novelty—to diagnose how well large language models capture DE introduction and reference. By adapting and extending the Schuster-Linzen paradigm, LIEDER uses a two-conjoined-clause context and a definite continuation across eight context-types to yield 128 items, measuring model felicity via probabilistic comparisons rather than direct judgments. Across open- and closed-source LLMs, the study finds strong knowledge of existence, uniqueness, and plurality but a notable deficit in novelty, which improves only when explicit cues distinguish distinct DEs. A second experiment shows that adding explicit novelty cues significantly enhances performance, underscoring the role of linguistic signals in DE introduction and highlighting a distance effect where DEs introduced earlier are harder to refer to. Overall, LIEDER provides a rigorous, fine-grained benchmark that reveals current limits of SOTA LLMs in human-like discourse entity handling and guides future linguistic-informed evaluation efforts.
Abstract
Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models' knowledge of four crucial semantic properties: existence, uniqueness, plurality, and novelty. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except novelty, which demonstrates that they have yet to reach human-level language understanding abilities.
