Table of Contents
Fetching ...

Understanding and Controlling a Maze-Solving Policy Network

Ulisse Mini, Peli Grietzer, Mrinank Sharma, Austin Meek, Monte MacDiarmid, Alexander Matt Turner

TL;DR

This study probes a pretrained maze-solving policy to uncover how it represents and selects goals, revealing multiple context-dependent objectives and identifying eleven cheese-tracking channels that encode the goal location. Through behavioral statistics and mechanistic analysis, the authors show these goals are represented redundantly and distributed across mid-network activations, enabling activation-engineering interventions. They demonstrate two non-training-based control methods: hand-editing cheese-tracking activations to retarget the policy and combining forward-pass steering vectors to bias behavior, both without retraining. The work advances understanding of goal-direction in policy networks and demonstrates practical avenues for influencing agent behavior via activation-level interventions, with implications for safety and interpretability in AI systems.

Abstract

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network contains redundant, distributed, and retargetable goal representations, shedding light on the nature of goal-direction in trained policy networks.

Understanding and Controlling a Maze-Solving Policy Network

TL;DR

This study probes a pretrained maze-solving policy to uncover how it represents and selects goals, revealing multiple context-dependent objectives and identifying eleven cheese-tracking channels that encode the goal location. Through behavioral statistics and mechanistic analysis, the authors show these goals are represented redundantly and distributed across mid-network activations, enabling activation-engineering interventions. They demonstrate two non-training-based control methods: hand-editing cheese-tracking activations to retarget the policy and combining forward-pass steering vectors to bias behavior, both without retraining. The work advances understanding of goal-direction in policy networks and demonstrates practical avenues for influencing agent behavior via activation-level interventions, with implications for safety and interpretability in AI systems.

Abstract

To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals, and we further identify circuits within the network that correspond to one of these goals. In particular, we identified eleven channels that track the location of the goal. By modifying these channels, either with hand-designed interventions or by combining forward passes, we can partially control the policy. We show that this network contains redundant, distributed, and retargetable goal representations, shedding light on the nature of goal-direction in trained policy networks.
Paper Structure (42 sections, 2 equations, 94 figures, 8 tables)

This paper contains 42 sections, 2 equations, 94 figures, 8 tables.

Figures (94)

  • Figure 1: Understanding and controlling a maze-solving policy. (a) We examine a maze-solving policy network that navigates within a maze towards a goal location, marked by cheese. During training, the cheese was placed in the upper right $5\times5$ corner of the maze---the historical goal location. However, during deployment, the cheese may be placed anywhere. The white dot shows a decision square where the policy must choose between navigating to the cheese and the top-right corner. (b) We identify residual channels whose activations track the location of the cheese. (c) We manually set one of these activations to +5.5. (d) We retarget the policy. Due to the modified activation during the forward pass, the policy goes to the location implied by the edited activation.
  • Figure 2: The policy network pursues multiple goals. During training, the cheese was always in the top right corner of the maze. We show trajectories in four mazes not from the training distribution. In mazes A and B, the policy ignores the cheese and navigates to the historical goal location (the top-right corner). However, in mazes C and D, the agent navigates to the cheese.
  • Figure 3: Network channels track the goal location. We show the activations for channel 55 after the first residual block of the second impala block. The activations of channel 55 are a $16\times 16$ grid. We plot the activation values for the same maze when the cheese is placed in different locations. (b-d) show that channel 55 tracks the cheese. See \ref{['appendix:track_cheese_more_examples']} for more examples.
  • Figure 4: Resampling cheese-tracking activations from different mazes. (a) Unmodified network behavior. (b) Resampling these activations from other mazes with the same cheese location does not affect the policy's behavior. (c) In contrast, if we replace the activations from a maze where the cheese is placed at a different location, the network behaves as if the cheese were at that location. If the cheese-tracking channel activations are resampled from a maze where the cheese is close to the historical cheese location, the policy navigates to the cheese. (d) If the cheese-tracking channel activations are resampled from a maze where the cheese is far from the historical cheese location, the policy ignores the cheese. Please see \ref{['appendix:causal_scrubbing_visual_more_examples']} for more examples.
  • Figure 5: Controlling the maze-solving policy by modifying a single activation. By modifying just a single network activation, we control where the policy navigates. We set a single activation in channel 55, one of the cheese-tracking channels, to a large positive value (+5.5; see also \ref{['fig:intro_figure']}c). The red dots show the location corresponding to the activation intervention, computed by linearly mapping the $16\times16$ activation grid to the 25$\times$25 game grid. (a-c) Successful policy retargeting. This intervention successfully makes the policy navigate to the red dot (the targeted location) and ignore the cheese in the maze. (d) We cannot make the policy navigate to arbitrary maze-locations. See \ref{['appendix:retargetting_one_channel_more_examples']} for more examples.
  • ...and 89 more figures