Table of Contents
Fetching ...

Sparse Visual Thought Circuits in Vision-Language Models

Yunpeng Zhou

Abstract

Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.

Sparse Visual Thought Circuits in Vision-Language Models

Abstract

Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear probes identify a mid decoder locus for task type information. We train SAEs at this layer, construct task-selective sets via an explicit rule, and perform inference time scaling and ablation while quantifying accuracy and drift. Our findings-validated with bootstrapped subsamples and permutation controls, and replicated across multiple VLM families and five diverse datasets clarify the boundaries of SAE feature composability and provide a rigorous diagnostic framework for more reliable VLM control.

Paper Structure

This paper contains 108 sections, 21 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Sparse Visual Thought Circuits (SVTC) in a frozen vision--language model. We (1) localize reasoning-related signals by probing pooled hidden states across layer groups for task_type (7-way) and difficulty (3-way), selecting a predictive site (e.g., decoder $M_{\text{middle\_2}}$); (2) train a sparse autoencoder at that site to map activations $h\in\mathbb{R}^{4096}$ to sparse codes $z\in\mathbb{R}^{32768}$ and construct task-selective feature sets (Pattern, Global, and their union) via rule-based selectivity; and (3) perform inference-time interventions by modifying selected coordinates of $z$ (scaling/ablation), decoding a reconstruction, and applying the resulting update to obtain $h'$. We report accuracy change $\Delta\mathrm{Acc}$ (pp), output drift Changed (%), and perturbation magnitude $\mathbb{E}[\lVert \Delta\rVert/\lVert h\rVert]$, with negative controls (random and permutation sets), bootstrap over subsampling seeds, and held-out validation on val/test.
  • Figure 2: The Compositional Paradox: From Circuit Discovery to Mechanical Failure.(A) Circuit Localization: Layerwise probing reveals a clear phase transition: visual reasoning capability ($l^\star$) emerges and peaks at Layer 21 (99.2% accuracy), distinct from the early Perception stage and the late Output saturation. (B) The Paradox: Across five benchmarks, while individual Pattern steering (circles) generally yields positive gains, the joint Union intervention (squares) consistently drives the model into the "Collapse Regime" (shaded pink), characterized by high output drift and accuracy degradation. (C) Mechanism Validation: Fine-grained analysis on the hypersensitive Layer 10 reveals a non-monotonic U-shaped recovery. Weak signals ($s < 0.5$) act as adversarial noise causing collapse, confirming that a critical signal-to-noise threshold must be overcome for effective steering.
  • Figure 3: Mechanism of Compositional Collapse.(A) Geometric Antagonism: Although individual Pattern and Global vectors are strong, their antagonistic alignment ($\theta > 90^\circ$) causes the Union vector to shrink significantly inside the unit circle (Signal Collapse). (B) Noise Amplification: This reduced signal magnitude $\|\delta\|$ pushes the system into the hypersensitive "Collapse Regime," where the layer normalization operator forces an exponential amplification of latent noise.
  • Figure A1: Spatial Grounding of Visual Thought. Activation heatmap of Feature #2398 from the Pattern Set. Despite being discovered via global pooling, the feature functions as a precise semantic detector, selectively activating (red) on the target visual objects while ignoring the background (blue).