State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models
Annie Wong, Aske Plaat, Thomas Bäck, Niki van Stein, Anna V. Kononova
TL;DR
State Design Matters investigates how inference-time state representations shape dynamic reasoning in large language models. By manipulating three axes granularity, structure, and spatial grounding and evaluating across multiple open-source models and dynamic benchmarks, the study shows trajectory summarisation often aids long-horizon decisions, natural language representations are broadly robust, and Visualization-of-Thought can improve spatial reasoning when models can construct reliable maps. The findings emphasize that how a state is represented—not just what information it conveys—drives performance and reveal brittleness of current LLMs and VLMs over long horizons. The work also identifies conditions under which structured or spatial representations help, and highlights directions for improving reliability, such as verification and improved spatial state tracking. Overall, the paper argues for a shift toward dynamic evaluation to truly capture agentic competence in evolving environments.
Abstract
As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.
