Table of Contents
Fetching ...

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Raghu Arghal, Fade Chen, Niall Dalton, Evgenii Kortukov, Calum McNamara, Angelos Nalmpantis, Moksh Nirvaan, Gabriele Sarti, Mario Giulianelli

TL;DR

The paper tackles the challenge of reliably attributing goals to agentic systems by introducing a framework that couples behavioural evaluation with probing of internal representations. It validates the approach on a fully observable grid-world navigation task using an LLM agent, benchmarking against an $A^*$-derived optimal policy and decoding cognitive maps and plans from internal activations. Key findings show that the agent exhibits goal-directed behavior that scales with task difficulty, encodes coarse spatial maps of position and goal, and that reasoning reorganizes representations to emphasize immediate action, while plan decoding reveals both near-term and longer-horizon planning information. The work provides a practical methodology for goal attribution and interpretability in autonomous AI systems, with implications for safety, monitoring, and extending to diverse architectures and tasks.

Abstract

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

TL;DR

The paper tackles the challenge of reliably attributing goals to agentic systems by introducing a framework that couples behavioural evaluation with probing of internal representations. It validates the approach on a fully observable grid-world navigation task using an LLM agent, benchmarking against an -derived optimal policy and decoding cognitive maps and plans from internal activations. Key findings show that the agent exhibits goal-directed behavior that scales with task difficulty, encodes coarse spatial maps of position and goal, and that reasoning reorganizes representations to emphasize immediate action, while plan decoding reveals both near-term and longer-horizon planning information. The work provides a practical methodology for goal attribution and interpretability in autonomous AI systems, with implications for safety, monitoring, and extending to diverse architectures and tasks.

Abstract

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing goals to agentic systems. We propose a framework for evaluating goal-directedness that integrates behavioural evaluation with interpretability-based analyses of models' internal representations. As a case study, we examine an LLM agent navigating a 2D grid world toward a goal state. Behaviourally, we evaluate the agent against an optimal policy across varying grid sizes, obstacle densities, and goal structures, finding that performance scales with task difficulty while remaining robust to difficulty-preserving transformations and complex goal structures. We then use probing methods to decode the agent's internal representations of the environment state and its multi-step action plans. We find that the LLM agent non-linearly encodes a coarse spatial map of the environment, preserving approximate task-relevant cues about its position and the goal location; that its actions are broadly consistent with these internal representations; and that reasoning reorganises them, shifting from broader environment structural cues toward information supporting immediate action selection. Our findings support the view that introspective examination is required beyond behavioural evaluations to characterise how agents represent and pursue their objectives.
Paper Structure (31 sections, 9 equations, 15 figures, 6 tables)

This paper contains 31 sections, 9 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Overview of our goal-directedness analysis. A: We evaluate how iso-difficulty transforms affect agent trajectories that agree or disagree with the optimal policy. B: We prompt an LLM-based agent to reason and act over the fully-observable grid setup, extracting its pre-and post-reasoning activations at intermediate layers. C: We probe the agent's beliefs over goal distance, planned actions and reconstruct cognitive maps for the current grid state.
  • Figure 2: Grid worlds with increasing wall density $d$, from fully open grids ($d=0$) to maze-like grids with no circular paths ($d=1$).
  • Figure 3: An example grid (left) and its corresponding text based representation (right) used for LLM prompting.
  • Figure 4: Top: Action accuracy (left) and mean JSD (right) in relation to the agent's distance from the goal. Bottom: Action accuracy by size, complexity, and goal distance.
  • Figure 5: Grid world variants with instrumental and implicit goals. In the text representation, the key and the door are encoded with K and D, and their meaning is explained in the system prompt.
  • ...and 10 more figures