Table of Contents
Fetching ...

ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

Zhaodong Wu, Haochen Xue, Qi Cao, Wenqi Mo, Yu Pei, Wenqi Xu, Jionglong Su, Yang Liu

TL;DR

ConFoThinking is proposed, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which it mine and zoom in salient regions for downstream visual understanding.

Abstract

Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.

ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

TL;DR

ConFoThinking is proposed, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which it mine and zoom in salient regions for downstream visual understanding.

Abstract

Thinking with Images improves fine-grained VQA for MLLMs by emphasizing visual cues. However, tool-augmented methods depend on the capacity of grounding, which remains unreliable for MLLMs. In parallel, attention-driven methods to crop the Region of Interest (ROIs) are proposed but they are constrained by (1) fragmented attention signals scattered across layers, leading to suboptimal localization and (2) relying on question- or redundant-text-conditioned attention extraction. Our analysis reveals three patterns: MLLMs may attend to the correct region yet generate incorrect coordinates, where-to-look attention is often fragmented across layers, and attention extraction is query-sensitive. Motivated by these, We propose ConFoThinking, a Consolidated-Focused-Attention-Driven Thinking framework that learns to aggregate attention into a designated intermediate layer, from which we mine and zoom in salient regions for downstream visual understanding. Moreover, we extract attention using concise semantic cues of what to look into, which mitigates the semantic noise introduced by question- or redundant-text-based attention extraction. Experiments across five VQA benchmarks demonstrate ConFoThinking significantly improves perception performance. The code, checkpoints, and dataset will be released after being accepted.
Paper Structure (51 sections, 9 equations, 5 figures, 4 tables)

This paper contains 51 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Limitations of existing methods. (A) In tool-augmented coordinate-output pipelines, MLLMs may "say" incorrect bounding-box coordinates, even though their intermediate vision-language fusion layers still attend to the GT ROI. The line chart shows the attention distribution inside the predicted and GT boxes for Qwen3-VL-8B (with tools) on V* benchmark wu2024v, where the model outputs an incorrect answer and the predicted box has IoU $<$ 0.1 with the ground-truth box. (B) In attention-driven methods, where-to-look signals are fragmented across layers, making any fixed-layer choice unreliable. The bar chart shows the distribution of the highest-ROI-attention layer for Qwen3-VL-8B across samples in VisCoT (only 19.3% at a fixed single layer (layer22)). (C) Where-to-look signals are query-sensitive: extracting attention from semantic visual cues is more accurate than extracting it from the raw question. The line chart compares the layer-wise attention distributions of these two approaches on the V* benchmark.
  • Figure 2: ConFoThinking overview. (A) Training ConFoAttn to produce a <FOCUS>...</FOCUS> span and fixed-layer attention heatmaps. (B) Training AttnDetector to regress ROI boxes from attention heatmaps. (C) Inference pipeline: generate heatmap, localize and zoom the ROI, then answer with the base MLLM using both the original and zoomed images.
  • Figure 3: Comparison for attention condensation.
  • Figure 4: Layer-wise comparison of attention concentration on $R_{\text{GT}}$ vs. $R_{\text{pred}}$ for Pixel-Reasoner over grounding-error cases ($\mathrm{IoU}<0.1$).
  • Figure 5: Distribution of the peak-attention layer $\ell^\star$ on the Validation set, VisCoT. Left: Qwen3-VL-4B (mode at Layer 22). Right: Qwen2.5-VL-7B (mode at Layer 20). Despite the existence of a modal layer, the distribution remains broadly dispersed across layers.