Table of Contents
Fetching ...

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso

TL;DR

This work provides a mechanistic interpretation of a model-free Sokoban agent (DRC(3,3)) by showing that future plans are encoded in dedicated path channels within the ConvLSTM hidden states. Planning proceeds via bidirectional plan-extension kernels that act as a learned transition model, enabling forward and backward propagation from boxes and targets, with negative activations pruning unpromising paths. A winner-takes-all mechanism and a causal intervention framework demonstrate that these path-channel activations function as a value-like signal guiding plan survival and action selection. The findings offer an interpretable, weight-level account of planning in a neural agent, with implications for scalability, auditability, and safety in built-to-think systems.

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

TL;DR

This work provides a mechanistic interpretation of a model-free Sokoban agent (DRC(3,3)) by showing that future plans are encoded in dedicated path channels within the ConvLSTM hidden states. Planning proceeds via bidirectional plan-extension kernels that act as a learned transition model, enabling forward and backward propagation from boxes and targets, with negative activations pruning unpromising paths. A winner-takes-all mechanism and a causal intervention framework demonstrate that these path-channel activations function as a value-like signal guiding plan survival and action selection. The findings offer an interpretable, weight-level account of planning in a neural agent, with implications for scalability, auditability, and safety in built-to-think systems.

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

Paper Structure

This paper contains 63 sections, 4 equations, 26 figures, 8 tables.

Figures (26)

  • Figure 1: A level with channel activations for a single channel from every group in \ref{['tab:group-definitions']}. Note the clear activations for the box move channels along the box 's path to the target .
  • Figure 2: Left: The $\text{DRC}(3,3)$ architecture. There are three layers of ConvLSTM modules with all the layers repeatedly applied three times before predicting the next action. Right: The ConvLSTM block in the DRC. Note the use of convolutions instead of linear layers, and the last layer of the previous tick ($h_{D}^{n-1}$) as input to the first layer. "Pool" refers to a weighted combination of mean- and max-pooling.
  • Figure 3: An idealized path channels diagram.
  • Figure 4: AUC scores for predicting box and agent movements from the path channels at different #s of timesteps out. Short-term channels have high AUC for up to 10 steps, while long-term channels show a high AUC for predicting actions beyond 10 steps until the end of the episode. The GNA/PNA path channels only exist for the agent, and have high AUC ($\sim$100%) for only the next action.
  • Figure 5: Activations of the long- and short-term channels for all directions when a different direction action takes place at $t=0$. All directions except the up direction show the long-term channel activations decreasing after the other action takes place at $t=0$. The mechanism of this transfer of activation from long to short-term is shown in \ref{['fig:transfer-activation']}.
  • ...and 21 more figures