Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Jingcheng Yang; Tianhu Xiong; Shengyi Qian; Klara Nahrstedt; Mingyuan Wu

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Jingcheng Yang, Tianhu Xiong, Shengyi Qian, Klara Nahrstedt, Mingyuan Wu

TL;DR

This work introduces the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning and reveals that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations.

Abstract

Vision-language models (VLMs) are powerful but remain opaque black boxes. We introduce the first framework for transparent circuit tracing in VLMs to systematically analyze multimodal reasoning. By utilizing transcoders, attribution graphs, and attention-based methods, we uncover how VLMs hierarchically integrate visual and semantic concepts. We reveal that distinct visual feature circuits can handle mathematical reasoning and support cross-modal associations. Validated through feature steering and circuit patching, our framework proves these circuits are causal and controllable, laying the groundwork for more explainable and reliable VLMs.

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 13 figures, 1 table)

This paper contains 27 sections, 11 equations, 13 figures, 1 table.

Introduction
Related Work
Circuit Tracing in VLMs
Transcoders
Attribution Graphs
Feature Interpretation and Attention Analysis
Circuit Discovery
Intervention and Steering
Experiments
Training Transcoders
Computing Attribution Graphs
Feature Analysis
Finding Circuits
Intervention Experiments
Empirical Findings
...and 12 more sections

Figures (13)

Figure 1: Given an image and a prompt, how can we extract a circuit, as an internal computation graph, of open-source vision language models such as Gemma3-4B gemma3 from Google. We introduce the first framework for successful circuit tracing in VLMs, enabling analysis of the internal circuits and association of concepts in underlying multimodal reasoning.
Figure 2: A summary of our method, we first train transcoders for Gemma-3-4b-it on our curated dataset, yielding a replacement model with monosemantic features. Then, we utilize feature activation analysis to obtain information that would provide interpretable information about the features. Then, we generate an attribution graph of a given prompt, and utilize human experts to derive the final circuit.
Figure 3: Transcoder vs SAE; SAEs learn to reconstruct model activations, whereas transcoders imitate MLP sublayers’ input-output behavior.
Figure 4: Top: Percentage of dead latents (Dead PCT) across layers for different values of $N_{\mathrm{latents}}$. Bottom: Fraction of variance unexplained (FVU) when training transcoders on a text-only dataset compared to our multimodal split.
Figure 5: Feature activations on images with a mix of feature annotations resulting from full images (right) and feature analysis resulting from noisy attention maps (left).
...and 8 more figures

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

TL;DR

Abstract

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Authors

TL;DR

Abstract

Table of Contents

Figures (13)