Table of Contents
Fetching ...

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy, Mac Schwager

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io
Paper Structure (49 sections, 13 equations, 12 figures, 4 tables)

This paper contains 49 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Activations for two general features related to grasping in the $\pi_{0.5}$ model's PaliGemma Layer 5 over an episode of LIBERO. These features activate generally for grasping and carrying behaviors across the dataset.
  • Figure 2: Overview of our mechanistic interpretability pipeline for VLA models. We collect internal activations from a VLA, train a Sparse Autoencoder (SAE), and obtain sparse features that represent both memorized episodes, as well as general semantic concepts, motion primitives and task structure. We propose a metric to categorize features by generality. Right: three grasp-related features on DROID, visualized by the top four activating episodes each. Memorization features fire on highly similar scenes, while general features activate across diverse scenes, tasks, and grasp types.
  • Figure 3: We present 4 general features identified across a range of LIBERO episodes. For each task, we show the wrist and main camera images equally spaced between $t_0$ and $t_f$. Below these images, we show the activations of four general features over the episode. The feature activations and camera images are in chronological order.
  • Figure 4: DROID general features across diverse tasks. For each episode, we show evenly spaced frames from the main and wrist cameras from $t_0$ to $t_f$, with per-timestep feature activations plotted below. Rows correspond to example tasks, and the color bars map to features F158, F586, F165, and F399 (legend at bottom).
  • Figure 5: Closed-loop steering results for F128 of $\pi_{0.5}$ LIBERO. We show XYZ trajectory plots of the unsteered and steered trajectories (top) and images across timesteps during steering (bottom)
  • ...and 7 more figures