Table of Contents
Fetching ...

CircuitProbe: Tracing Visual Temporal Evidence Flow in Video Language Models

Yiming Zhang, Zhuokai Zhao, Chengzhang Yu, Kun Wang, Zhendong Chu, Qiankun Li, Zihan Chen, Yang Liu, Zenghui Ding, Yining Sun, Qingsong Wen

Abstract

Autoregressive large vision--language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space as continuous visual token embeddings. However, it remains unclear where temporal evidence is represented and how it causally influences decoding. To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages: (i) Visual Auditing, which localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations and controlled substitutions; and (ii) Semantic Tracing, which uses logit-lens probing to track the layer-wise emergence of object and temporal concepts, augmented with temporal frame interventions to assess sensitivity to temporal structure. Based on the resulting analysis, we design a targeted surgical intervention that strictly follows our observations: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval revealed by Semantic Tracing. This analysis-driven intervention yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark, validating the correctness, effectiveness, and practical value of the proposed circuit-level analysis for temporal understanding in LVLMs.

CircuitProbe: Tracing Visual Temporal Evidence Flow in Video Language Models

Abstract

Autoregressive large vision--language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space as continuous visual token embeddings. However, it remains unclear where temporal evidence is represented and how it causally influences decoding. To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages: (i) Visual Auditing, which localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations and controlled substitutions; and (ii) Semantic Tracing, which uses logit-lens probing to track the layer-wise emergence of object and temporal concepts, augmented with temporal frame interventions to assess sensitivity to temporal structure. Based on the resulting analysis, we design a targeted surgical intervention that strictly follows our observations: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval revealed by Semantic Tracing. This analysis-driven intervention yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark, validating the correctness, effectiveness, and practical value of the proposed circuit-level analysis for temporal understanding in LVLMs.

Paper Structure

This paper contains 13 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our Tracing Circuit Framework. We systematically analyze LVLMs by decomposing the circuit's information flow into three modules: ➀ visual auditing, ➁ semantic tracing, and ➂ attention flow.
  • Figure 2: Overview of the circuits experiments. Our methodology comprises three key interventions: ➀ strategically modifying specific visual token subsets via ablation and text injection; ➁ tracing the semantic evolution of tokens across layers using logit lens; and ➂ masking attention pathways to analyze information flow within the LVLM.
  • Figure 3: Results of the text injection experiment. The result shows the performance change when visual object tokens are replaced by their corresponding embedded textual labels. This direct injection of symbolic information significantly improves performance, often surpassing the original baseline model, which underscores the potent effect of clean semantic signals.
  • Figure 4: Quantitative analysis of semantic tracing. Both metrics show a sharp increase in the mid-to-late layers, indicating that abstract semantic concepts are consolidated deep within the network.
  • Figure 5: Qualitative example of semantic tracing illustrating how the model captures temporal dynamics. The sequence from (a) to (d) shows the evolution of the top-3 most frequent word groups for the single token position. The predictions shift from semantics related to an initial state (e.g., sitting) to those reflecting the completed action (e.g., standing), highlighting the model's ability to track actions over time.