Table of Contents
Fetching ...

LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition

Xiaomeng Zhu, Robert Frank

TL;DR

LIEDER introduces a linguistically informed evaluation framework for Discourse Entity Recognition, targeting four semantic properties—existence, uniqueness, plurality, and novelty—to diagnose how well large language models capture DE introduction and reference. By adapting and extending the Schuster-Linzen paradigm, LIEDER uses a two-conjoined-clause context and a definite continuation across eight context-types to yield 128 items, measuring model felicity via probabilistic comparisons rather than direct judgments. Across open- and closed-source LLMs, the study finds strong knowledge of existence, uniqueness, and plurality but a notable deficit in novelty, which improves only when explicit cues distinguish distinct DEs. A second experiment shows that adding explicit novelty cues significantly enhances performance, underscoring the role of linguistic signals in DE introduction and highlighting a distance effect where DEs introduced earlier are harder to refer to. Overall, LIEDER provides a rigorous, fine-grained benchmark that reveals current limits of SOTA LLMs in human-like discourse entity handling and guides future linguistic-informed evaluation efforts.

Abstract

Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models' knowledge of four crucial semantic properties: existence, uniqueness, plurality, and novelty. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except novelty, which demonstrates that they have yet to reach human-level language understanding abilities.

LIEDER: Linguistically-Informed Evaluation for Discourse Entity Recognition

TL;DR

LIEDER introduces a linguistically informed evaluation framework for Discourse Entity Recognition, targeting four semantic properties—existence, uniqueness, plurality, and novelty—to diagnose how well large language models capture DE introduction and reference. By adapting and extending the Schuster-Linzen paradigm, LIEDER uses a two-conjoined-clause context and a definite continuation across eight context-types to yield 128 items, measuring model felicity via probabilistic comparisons rather than direct judgments. Across open- and closed-source LLMs, the study finds strong knowledge of existence, uniqueness, and plurality but a notable deficit in novelty, which improves only when explicit cues distinguish distinct DEs. A second experiment shows that adding explicit novelty cues significantly enhances performance, underscoring the role of linguistic signals in DE introduction and highlighting a distance effect where DEs introduced earlier are harder to refer to. Overall, LIEDER provides a rigorous, fine-grained benchmark that reveals current limits of SOTA LLMs in human-like discourse entity handling and guides future linguistic-informed evaluation efforts.

Abstract

Discourse Entity (DE) recognition is the task of identifying novel and known entities introduced within a text. While previous work has found that large language models have basic, if imperfect, DE recognition abilities (Schuster and Linzen, 2022), it remains largely unassessed which of the fundamental semantic properties that govern the introduction and subsequent reference to DEs they have knowledge of. We propose the Linguistically-Informed Evaluation for Discourse Entity Recognition (LIEDER) dataset that allows for a detailed examination of language models' knowledge of four crucial semantic properties: existence, uniqueness, plurality, and novelty. We find evidence that state-of-the-art large language models exhibit sensitivity to all of these properties except novelty, which demonstrates that they have yet to reach human-level language understanding abilities.
Paper Structure (34 sections, 4 equations, 16 figures, 5 tables)

This paper contains 34 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Results for singular continuations by model and comparison type. The dotted lines indicate chance performance and the error bars indicate bootstrapped 95% confidence intervals.
  • Figure 2: Preference for neg_pos over pos_neg by model.
  • Figure 3: Decomposition of results for affirmative-negation type sentences in schuster-linzen-2022-sentence by distance. Data for GPT-2, GPT-2 M, GPT-2 L, GPT-2 XL, and GPT-3 are retrieved from their GitHub Repository.
  • Figure 4: Results for plural comparisons by model and comparison type.
  • Figure 5: Results by model and comparison type for comparisons across singular and plural continuations.
  • ...and 11 more figures