Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che; Zhiyu Xue; Yihao Quan; Benlin Liu; Zeru Shi; Michelle Hurst; Jacob Feldman; Ruixiang Tang; Ranjay Krishna; Vladimir Pavlovic

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic

Abstract

Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Abstract

Paper Structure (54 sections, 18 equations, 19 figures, 11 tables)

This paper contains 54 sections, 18 equations, 19 figures, 11 tables.

Introduction
Related Work
LVLMs for counting task.
Mechanistic Interpretability on LLM/LVLM.
Problem Formulation
Do LVLMs know how to count? Or do they memorize?
Understand Layerwise Behavior of Counting
Information Flow and Cross-Modal Routing
Mechanistic Analysis on Attention Head Function
Important Attention Heads for Counting
HeadLens: Decoding Individual Attention Heads
Revealing Attention Head Functionalities
Enhancing LVLMs' counting ability
Object-Focused Attention Regularizer
Adaptive Head Temperature Tuning
...and 39 more sections

Figures (19)

Figure 1: Overview of Main Contributions
Figure 2: Counting Uncertainty Curve by Yes/No Answer of Qwen2.5 VL 7B.
Figure 3: T-SNE visualization of model hidden states based on black dots data.
Figure 4: Left: Layer-wise Overwrite Rate of Different Input Tokens Patching Strategy; Right: Layer-wise Logit Lens Tracking Curve for Counting Number.
Figure 5: Visualization of two functional groups of counting heads. Left to right: Head importance (positive-only), attention ratio on image tokens, and top-10 HeadLens CTER across heads. We present the typical attention distribution and top-10 HeadLens results for Cross-modal Routing Heads (green boxes) and Counting Aggregation Heads (blue boxes), using L19H23(left) and L26H26(right). Both have "three" (三) in Chinese as the top-1 HeadLens token.
...and 14 more figures

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Abstract

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Authors

Abstract

Table of Contents

Figures (19)