Table of Contents
Fetching ...

From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task

Nicolas Martorell

TL;DR

The paper examines how text-based spatial representations shape LLM navigation in a grid-world task, comparing Cartesian, Topographic, and Textual encodings across LLaMA-3 models from $1\mathrm{B}$ to $90\mathrm{B}$ parameters on a $5\times 5$ grid. It finds Cartesian representations yield higher success and path efficiency, with performance scaling with model size and JSON formatting often performing best. Probing reveals mid-layer units that encode agent position and action correctness across representations, including a subset active in unrelated spatial reasoning, indicating an abstract internal spatial model, though ablations show space encoding is distributed and not strictly necessary for task success. These findings advance interpretability of spatial processing in LLMs and offer guidance for designing robust, agentic AI systems that rely on spatial reasoning, while outlining limitations and directions for extending to larger, more complex, and multimodal scenarios.

Abstract

Understanding how large language models (LLMs) represent and reason about spatial information is crucial for building robust agentic systems that can navigate real and simulated environments. In this work, we investigate the influence of different text-based spatial representations on LLM performance and internal activations in a grid-world navigation task. By evaluating models of various sizes on a task that requires navigating toward a goal, we examine how the format used to encode spatial information impacts decision-making. Our experiments reveal that cartesian representations of space consistently yield higher success rates and path efficiency, with performance scaling markedly with model size. Moreover, probing LLaMA-3.1-8B revealed subsets of internal units, primarily located in intermediate layers, that robustly correlate with spatial features, such as the position of the agent in the grid or action correctness, regardless of how that information is represented, and are also activated by unrelated spatial reasoning tasks. This work advances our understanding of how LLMs process spatial information and provides valuable insights for developing more interpretable and robust agentic AI systems.

From Text to Space: Mapping Abstract Spatial Models in LLMs during a Grid-World Navigation Task

TL;DR

The paper examines how text-based spatial representations shape LLM navigation in a grid-world task, comparing Cartesian, Topographic, and Textual encodings across LLaMA-3 models from to parameters on a grid. It finds Cartesian representations yield higher success and path efficiency, with performance scaling with model size and JSON formatting often performing best. Probing reveals mid-layer units that encode agent position and action correctness across representations, including a subset active in unrelated spatial reasoning, indicating an abstract internal spatial model, though ablations show space encoding is distributed and not strictly necessary for task success. These findings advance interpretability of spatial processing in LLMs and offer guidance for designing robust, agentic AI systems that rely on spatial reasoning, while outlining limitations and directions for extending to larger, more complex, and multimodal scenarios.

Abstract

Understanding how large language models (LLMs) represent and reason about spatial information is crucial for building robust agentic systems that can navigate real and simulated environments. In this work, we investigate the influence of different text-based spatial representations on LLM performance and internal activations in a grid-world navigation task. By evaluating models of various sizes on a task that requires navigating toward a goal, we examine how the format used to encode spatial information impacts decision-making. Our experiments reveal that cartesian representations of space consistently yield higher success rates and path efficiency, with performance scaling markedly with model size. Moreover, probing LLaMA-3.1-8B revealed subsets of internal units, primarily located in intermediate layers, that robustly correlate with spatial features, such as the position of the agent in the grid or action correctness, regardless of how that information is represented, and are also activated by unrelated spatial reasoning tasks. This work advances our understanding of how LLMs process spatial information and provides valuable insights for developing more interpretable and robust agentic AI systems.

Paper Structure

This paper contains 26 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Grid-World Spatial Orientation Task (GWSOT). A Goal (G, yellow) is placed in a random position in a 5×5 grid (left-most panels). An Agent (A, blue) is placed in a semi-random location in the same grid, at least two steps away from the Goal. Green arrows (top panel) show a correct path which leads to the goal. Red arrows (bottom panel) shows an incorrect path which leads to task failure. The six panels on the right show examples of the three SIR classes and six SIR types used across this study.
  • Figure 2: Model performance improves with model size and is influenced by SIR type. The first three panels show performance metrics in the 5×5 GWSOT for Cartesian (green), Topographic (red) and Textual (blue) SIR classes. In all cases, the gray dashed line shows the performance of a random policy agent. Left-most panel shows the success rate of LLMs in this task as a function of model size. Second panel show the mean efficiency of LLMs in this task as a function of model size, only for trials where the goal was reached. Third panel shows the mean final distance ratio as a function of model size, only for trials where the goal was not reached. In all cases, the x axis (model size) is logarithmic and shadings represent standard error. The right-most four panels depict example policy maps for the LLaMA-3.1-8B and the LLaMA-3.2-90B models, for the JSON and Symbol Grid SIRs. Arrows represent the most common action chosen by the model in each relative position to the goal. Brighter red arrows represent more commonly taken actions, whereas darker arrows represent uncertain decisions.
  • Figure 3: Activations in LLaMA-3.1-8B predict grid configuration. Left-most top panel shows the R² of linear models trained on activations from individual layers to predict the full configuration of the 5×5 grid for each of the six SIR types. Second top panel shows the R² of linear models trained on each individual SIR type (rows) and evaluated on every individual SIR type (columns), averaged across models trained on each layer. The color scale is cut-off at -15 for visualization purposes. The third top panel shows the same cross-prediction R² measure but evaluated only for models trained on the last layer. The left-most bottom panel shows the number of parameters from each layer that were significantly correlated with the agent’s position being in one specific cell from the 5×5 grid, for each SIR type. The last three bottom panels show the number of units in each layer that were significantly correlated with the agent’s x position, y position or with border cells, for all six SIR types (black curves). Dashed gray lines shows the same calculation performed on shuffled parameter indices.
  • Figure 4: Activations in LLaMA-3.1-8B predict action correctness across representation types. The left-most panel compares the total number of parameters per layer that are significantly correlated with action correctness for each SIR type. The second and third panels show the number of common significant parameters across all representation types and across at least 5 of the 6 representation types, respectively (black curves) compared to a shuffle between representations (gray dashed curves). The last panel shows the distribution of correlation coefficients between parameter activation and the spatial nature of a question in an unrelated task, for each parameter identified in the second panel as significantly correlated with action correctness for all SIR types in the GWSOT.
  • Figure 5: Supplementary results from probing analysis. Left-most panel shows six heatmaps displaying the number of units that were significantly correlated with the position of the agent being a specific grid cell, for each SIR type. Gray cells denote places where no units were significantly correlated with that position. Note the color scale is logarithmic. The last three panels show the number of parameters per layer significantly correlated with a spatial feature of interest, for each SIR type. The features displayed are whether the agent is located on the grid border, the agent’s x coordinate and the agent’s y coordinate, respectively.