Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant; Xijia Zhao; Peng Wang

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant, Xijia Zhao, Peng Wang

Abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA \texttt{libero\_goal}: 94\%$\to$10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics ($2\times$ greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Abstract

10\% under wrong prompts vs.\ \texttt{libero\_object}: 60--100\% regardless). In all three multi-pathway architectures (\pizhalf{}, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics (

greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts, and causal ablation reveals sensitivity spanning 28--92\% zero-effect rates independent of representation width. We release \textbf{Action Atlas} (https://action-atlas.com) for interactive exploration of VLA representations across all six models.

Paper Structure (70 sections, 4 equations, 23 figures, 29 tables)

This paper contains 70 sections, 4 equations, 23 figures, 29 tables.

Introduction
Related Work
Method
VLA Architectures Under Study
Activation Injection
Counterfactual Prompting
Sparse Autoencoders for VLAs
Per-Token Processing.
Feature Identification.
Scale.
Linear Probes for Action Prediction
Metrics
Experiments
Experimental Setup
Benchmarks and Scale.
...and 55 more sections

Figures (23)

Figure 1: Three core findings on $\pi_{0.5}$.Left: Activation injection recovers baseline behavior from null-prompt episodes. Without injection, null prompts drop cosine similarity to 0.775; injecting a single layer (L0) recovers 0.997 and all layers recovers 0.999, demonstrating visual pathway dominance. Middle: Per-token SAE processing is essential. Mean-pooled SAE reconstruction destroys task success (96%$\to$8%) despite high explained variance, while per-token processing preserves performance (96%$\to$94%). Right: Cross-task injection fails destination tasks (83.3%$\to$2.2%) and same-scene injection partially succeeds (35.0%), confirming spatially bound motor programs. These patterns replicate across all six models (Table \ref{['tab:cross-model']}).
Figure 2: Methodology overview. Top: activations are recorded from VLA backbone and action expert layers during rollout episodes, then replayed under counterfactual conditions (null prompts, cross-task scenes) to establish causal relationships via behavioral change. Middle: per-token SAEs decompose layer activations into sparse features. Bottom: features are clustered, searched, and causally validated through ablation and steering experiments, with results visualized in Action Atlas.
Figure 3: Cross-task displacement override rates. Left: override rate across five models. $\pi_{0.5}$ (99.6%, $n{=}1{,}968$) and X-VLA (99.8%, $n{=}3{,}150$) show near-complete source behavior transfer; OFT 77.9% ($n{=}1{,}079$); GR00T 57.0% ($n{=}270$, suite-dependent: goal 85.6%, long 33.3%). Error bars: 95% Wilson CIs. Right: SmolVLA pathway displacement (15.8% expert vs. 9.0% VLM, 732 pairs).
Figure 4: Concept ablation causal sensitivity across five models. Each bar shows the fraction of concept-task pairs with zero effect (gray), partial effect (blue), and total destruction ($-100$pp, red) under single-feature ablation. SmolVLA (480-dim expert) is the most sensitive at 28% zero-effect rate; OFT (4096-dim) and X-VLA (1024-dim) are the most resilient at 92% and 82% respectively. Causal sensitivity does not follow representation width: X-VLA approaches OFT despite sharing $\pi_{0.5}$'s 1024-dim hidden size.
Figure 5: PUT concept ablation (L8): "Put the cream cheese in the bowl."Top (green): Baseline. The robot picks up the cream cheese and places it in the bowl (91 steps). Bottom (red): With PUT features zeroed at layer 8, the robot drops the cream cheese into the bowl, knocking it over (300 steps, task failure).
...and 18 more figures

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Abstract

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Authors

Abstract

Table of Contents

Figures (23)