Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys; George Adamopoulos; Mehrab Hamidi; Stephanie Milani; Mohammad Reza Samsami; Artem Zholus; Sonia Joseph; Blake Richards; Irina Rish; Özgür Şimşek

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Richards, Irina Rish, Özgür Şimşek

TL;DR

This study probes interpretability of VPT, a 250M-parameter Minecraft agent, by applying mechanistic interpretability techniques to a long-horizon task in MineRL. It analyzes attention weights/outputs, saliency, and ablations, and conducts input manipulations and behavioral interventions. Key findings show that VPT preserves task coherence using a short memory window of $128$ frames ($ ext{about }6$ s) and relies on recent frames plus key event frames; it also uncovers a genuine risk of goal misgeneralization where a brown villager under leaves is mistaken for a tree and attacked. The work highlights limitations (environmental specificity, seed variation) and argues for scalable, automatic interpretability methods to improve safety and transparency of vision-based agents.

Abstract

Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making tasks is critical to ensuring that such systems operate transparently and safely. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We aim to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task - crafting a diamond pickaxe. The agent pays attention to the last four frames and several key-frames further back in its six-second memory. This is a possible mechanism for maintaining coherence in a task that takes 3-10 minutes, despite the short memory span. Secondly, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk when the villager is positioned stationary under green tree leaves, and punches it to death.

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

TL;DR

frames (

s) and relies on recent frames plus key event frames; it also uncovers a genuine risk of goal misgeneralization where a brown villager under leaves is mistaken for a tree and attacked. The work highlights limitations (environmental specificity, seed variation) and argues for scalable, automatic interpretability methods to improve safety and transparency of vision-based agents.

Abstract

Paper Structure (31 sections, 2 equations, 30 figures)

This paper contains 31 sections, 2 equations, 30 figures.

Introduction
Mechanistic interpretability.
Mechanistic interpretability on agents.
Related work
Mechanistic interpretability.
Interpretable reinforcement learning.
Background
Environment---MineRL.
Agent---VPT.
Agent---Steve-1.
Attention mechanism.
Attention visualization
Attention weights
Attention outputs
Interventions and ablations
...and 16 more sections

Figures (30)

Figure 1: We use various interpretability techniques on the Minecraft playing agent VPT to better understand how it makes decisions. These include visualizing attention head weights and outputs, feature visualization, saliency maps, ablating attention head outputs, manipulating the input stream, and others. (Top) a part of a regular episode. (Bottom) an episode with a "villager-tree". See https://youtu.be/g-jd6OyOcUs.
Figure 2: (Middle) Visualization of a trajectory up to crafting a stone pickaxe. The leftmost pixel of each frame corresponds to the time step in the attention plots. (Top) Attention weights of attention head 2.2---note the different pattern above the 3rd frame. This coincides with the camera moving up. The vertical axis is the 128 attention weights, the horizontal axis is time. (Bottom) Max attention weights over all attention heads. This shows that most attention is paid to 3--4 past frames and some key-frames. See https://youtu.be/BeqSthHRyLA and https://youtu.be/3GhhEysmSY4.
Figure 3: The frame that each attention head is paying the most attention to at a single point in time, right after placing a crafting table. Brightness indicates the magnitude of the attention. For example, heads 0.9 (1st row, 10th frame) and 1.9 are looking at the previous frame; heads 2.13, and 3.13 are looking at the current frame (crafting table placed). Other heads are looking at the inventory menu at different earlier times, some with the recipe book open, and some closed. See https://youtu.be/3GhhEysmSY4.
Figure 4: All 128 output z-scores of attention head 2.2 over the first 200 frames of a regular episode. The different pattern on the right coincides with the agent looking up. Regular vertical lines in the first half match the attacking arm looping every 4 frames. These disappear in the second half. The agent is still attacking, but the arm is replaced by a smaller object---an oak log it just chopped. See https://youtu.be/TbTBWdb6jSo.
Figure 5: Top-down view of trajectories when VPT is presented with a choice between two villagers standing under tree leaves, one on the left, one on the right. The identical scenarios produce different trajectories due to stochastic actions, including one trajectory where the agent turns around and goes in the other direction.
...and 25 more figures

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

TL;DR

Abstract

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Authors

TL;DR

Abstract

Table of Contents

Figures (30)