Table of Contents
Fetching ...

Entity Tracking in Language Models

Najoung Kim, Sebastian Schuster

TL;DR

This paper investigates whether large language models can track the states of discourse entities as texts evolve. It introduces a robust, code- and text-grounded task where models infer final entity states from initial descriptions and sequences of state-changing operations, and evaluates both zero-shot/in-context behavior and finetuned learning. The authors find that GPT-3.5 models with code pretraining exhibit non-trivial entity-tracking capabilities, whereas vanilla text-only models do not; they further show that smaller models like T5 can learn this ability through finetuning, with generalization limited by lexical overlap and operation complexity. Overall, the work highlights the importance of pretraining data composition for emergent world-modeling abilities and provides a rigorous evaluation framework to study discourse-level reasoning in language models.

Abstract

Keeping track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. Yet, there have been few systematic investigations into the ability of large language models (LLMs) to track discourse entities. In this work, we present a task probing to what extent a language model can infer the final state of an entity given an English description of the initial state and a series of state-changing operations. We use this task to first investigate whether Flan-T5, GPT-3 and GPT-3.5 can track the state of entities, and find that only GPT-3.5 models, which have been pretrained on large amounts of code, exhibit this ability. We then investigate whether smaller models pretrained primarily on text can learn to track entities, through finetuning T5 on several training/evaluation splits. While performance degrades for more complex splits, we find that even when evaluated on a different set of entities from training or longer operation sequences, a finetuned model can perform non-trivial entity tracking. Taken together, these results suggest that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface.

Entity Tracking in Language Models

TL;DR

This paper investigates whether large language models can track the states of discourse entities as texts evolve. It introduces a robust, code- and text-grounded task where models infer final entity states from initial descriptions and sequences of state-changing operations, and evaluates both zero-shot/in-context behavior and finetuned learning. The authors find that GPT-3.5 models with code pretraining exhibit non-trivial entity-tracking capabilities, whereas vanilla text-only models do not; they further show that smaller models like T5 can learn this ability through finetuning, with generalization limited by lexical overlap and operation complexity. Overall, the work highlights the importance of pretraining data composition for emergent world-modeling abilities and provides a rigorous evaluation framework to study discourse-level reasoning in language models.

Abstract

Keeping track of how states of entities change as a text or dialog unfolds is a key prerequisite to discourse understanding. Yet, there have been few systematic investigations into the ability of large language models (LLMs) to track discourse entities. In this work, we present a task probing to what extent a language model can infer the final state of an entity given an English description of the initial state and a series of state-changing operations. We use this task to first investigate whether Flan-T5, GPT-3 and GPT-3.5 can track the state of entities, and find that only GPT-3.5 models, which have been pretrained on large amounts of code, exhibit this ability. We then investigate whether smaller models pretrained primarily on text can learn to track entities, through finetuning T5 on several training/evaluation splits. While performance degrades for more complex splits, we find that even when evaluated on a different set of entities from training or longer operation sequences, a finetuned model can perform non-trivial entity tracking. Taken together, these results suggest that language models can learn to track entities but pretraining on text corpora alone does not make this capacity surface.
Paper Structure (34 sections, 9 figures, 7 tables)

This paper contains 34 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: A sketch of our entity tracking task.
  • Figure 2: Accuracy on state prediction after $n$ operations that affect a specific box. Left: predictions for boxes whose content differs from the initial state, Right: predictions for boxes whose content is the same as in the initial state. Error bars show 95% CIs.
  • Figure 3: Entity tracking accuracy of text-davinci-003 with low lexical overlap between demonstration and test examples (AltForms).
  • Figure 4: Entity tracking accuracy of text-davinci-003 for the AmbiRef (left) and MoveContents (right) datasets.
  • Figure 5: Accuracy on state prediction for different GPT-3 models. Solid lines denote models trained on code and text, and dotted lines denote models mainly trained on text.
  • ...and 4 more figures