Table of Contents
Fetching ...

State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

Annie Wong, Aske Plaat, Thomas Bäck, Niki van Stein, Anna V. Kononova

TL;DR

State Design Matters investigates how inference-time state representations shape dynamic reasoning in large language models. By manipulating three axes granularity, structure, and spatial grounding and evaluating across multiple open-source models and dynamic benchmarks, the study shows trajectory summarisation often aids long-horizon decisions, natural language representations are broadly robust, and Visualization-of-Thought can improve spatial reasoning when models can construct reliable maps. The findings emphasize that how a state is represented—not just what information it conveys—drives performance and reveal brittleness of current LLMs and VLMs over long horizons. The work also identifies conditions under which structured or spatial representations help, and highlights directions for improving reliability, such as verification and improved spatial state tracking. Overall, the paper argues for a shift toward dynamic evaluation to truly capture agentic competence in evolving environments.

Abstract

As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

TL;DR

State Design Matters investigates how inference-time state representations shape dynamic reasoning in large language models. By manipulating three axes granularity, structure, and spatial grounding and evaluating across multiple open-source models and dynamic benchmarks, the study shows trajectory summarisation often aids long-horizon decisions, natural language representations are broadly robust, and Visualization-of-Thought can improve spatial reasoning when models can construct reliable maps. The findings emphasize that how a state is represented—not just what information it conveys—drives performance and reveal brittleness of current LLMs and VLMs over long horizons. The work also identifies conditions under which structured or spatial representations help, and highlights directions for improving reliability, such as verification and improved spatial state tracking. Overall, the paper argues for a shift toward dynamic evaluation to truly capture agentic competence in evolving environments.

Abstract

As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.
Paper Structure (30 sections, 3 equations, 2 figures, 7 tables)

This paper contains 30 sections, 3 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Normalized score per input token for Long Form vs. Summary prompting. Bars report normalised score per input token (×1000), computed as the episode score (normalised to 0–1) divided by the average input prompt tokens. Higher values indicate better performance per token spent. The Long Form condition provides the acting agent with the full interaction history at each step, whereas Summary replaces this history with a rolling summary intended to preserve task-relevant state while reducing context length. Empty bars indicate settings where the agent achieves zero score. Token counts include only the input prompt shown to the acting agent. The Summary condition additionally requires a separate summarisation call, those tokens are excluded to isolate the effect of input context length on decision quality (i.e., we are not measuring end-to-end efficiency here.)
  • Figure 2: Effect of spatial grounding on performance relative to the text-only baseline. We report normalised performance differences from adding spatial grounding via images (Vision) or text-based spatial maps (VoT); positive values indicate improvement and negative values indicate degradation, while near-zero values indicate no measurable change (neutral). A superscript $^{*}$ marks a significant difference relative to the baseline for the same model and task, using a bootstrap test of the mean difference (10,000 resamples; 95% CI). Overall, VoT improves over baseline in 15/24 instances, degrades in 5/24, and is neutral in 4/24; Vision improves in 10/24, degrades in 6/24, and is neutral in 8/24. Significant improvements are from VoT in Hanoi for LLaVA-7B, in BabyAI Open for Qwen3-VL-32B, and in BabyAI Pickup for Qwen2.5VL-7B and Qwen3-VL-32B; VoT also shows a significant degradation in Messenger for LLaVA-7B. For Vision, significant effects are mostly degradations in Hanoi for LLaVA-Phi3-3.8B and Qwen3-VL-32B, with only a marginal significant improvement in Messenger for LLaVA-7B.