Table of Contents
Fetching ...

Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar

TL;DR

This work investigates whether Transformer-based LLMs can maintain a coherent internal state while applying sequential updates. It introduces a simple, three-task benchmark—LinearWorld, HandSwap, and Lights—to isolate state-tracking behavior and evaluates multiple models with and without Chain-of-Thought prompting across depths up to $d=10$. Key contributions include a dataset construction method for elementary state-tracking tasks, a broad cross-model analysis that isolates state-tracking performance, and an analysis of factors influencing success (depth, prompting, and memory via the input tape). Findings show that GPT-4o and Llama3-70B maintain state effectively when using CoT, while smaller or older models often fail at higher depths; CoT generally enhances performance and supports using context as temporary memory. The benchmark offers practical guidance for prompting strategies and deployment decisions in real-world applications requiring state tracking in natural-language–trained transformers.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

Exploring State Tracking Capabilities of Large Language Models

TL;DR

This work investigates whether Transformer-based LLMs can maintain a coherent internal state while applying sequential updates. It introduces a simple, three-task benchmark—LinearWorld, HandSwap, and Lights—to isolate state-tracking behavior and evaluates multiple models with and without Chain-of-Thought prompting across depths up to . Key contributions include a dataset construction method for elementary state-tracking tasks, a broad cross-model analysis that isolates state-tracking performance, and an analysis of factors influencing success (depth, prompting, and memory via the input tape). Findings show that GPT-4o and Llama3-70B maintain state effectively when using CoT, while smaller or older models often fail at higher depths; CoT generally enhances performance and supports using context as temporary memory. The benchmark offers practical guidance for prompting strategies and deployment decisions in real-world applications requiring state tracking in natural-language–trained transformers.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

Paper Structure

This paper contains 34 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: A simple illustration of the initial state of the tasks (top) and the updated state after a singular update step (bottom) of the tasks. Note that the configuration shown for Lights is not complete as only two switches are depicted in the image; for the exact configuration of rooms and switches, please refer to section \ref{['sec:task_configuration']}.
  • Figure 2: Average accuracy at different depths across tasks (left: LinearWorld, middle: Hands, and right: Lights) for all systems except for the two top performers which use Chain of Thought (CoT), i.e., Llama3 70B and GPT-4.
  • Figure 3: Accuracy at different depths comparing the "swap" update type with the "integer" update type in the LinearWorld task (the state-dependent query variants) for all CoT integrated systems.
  • Figure 4: Average number of mathematical expressions per model response for all the models (left), and accuracy of generated expression evaluations across models integrated with CoT (right).
  • Figure 5: Accuracy at different depths comparing the "state-dependant" query type with the "random" query type in the LinearWorld task (the variants with 5 individuals) accross all the systems.