Exploring State Tracking Capabilities of Large Language Models
Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar
TL;DR
This work investigates whether Transformer-based LLMs can maintain a coherent internal state while applying sequential updates. It introduces a simple, three-task benchmark—LinearWorld, HandSwap, and Lights—to isolate state-tracking behavior and evaluates multiple models with and without Chain-of-Thought prompting across depths up to $d=10$. Key contributions include a dataset construction method for elementary state-tracking tasks, a broad cross-model analysis that isolates state-tracking performance, and an analysis of factors influencing success (depth, prompting, and memory via the input tape). Findings show that GPT-4o and Llama3-70B maintain state effectively when using CoT, while smaller or older models often fail at higher depths; CoT generally enhances performance and supports using context as temporary memory. The benchmark offers practical guidance for prompting strategies and deployment decisions in real-world applications requiring state tracking in natural-language–trained transformers.
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.
