Table of Contents
Fetching ...

Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

Isha Chaturvedi, Anjana Nair, Yushen Li, Adhitya Rajendra Kumar, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma

TL;DR

This work introduces Contrastive Region Masking (CRM), a training-free diagnostic that causally links each step of multimodal LLMs’ chain-of-thought to specific visual regions by region masking and trace comparison. By evaluating on the VisArgs dataset, CRM uncovers distinct failure modes across state-of-the-art models—some maintain reasoning structure yet hallucinate when evidence is missing, others ground to visual cues but falter under perturbations. The approach shifts evaluation from final answer accuracy to the faithfulness and robustness of intermediate reasoning, framing visual benchmarks as diagnostic tools for interpretability and reliability in multimodal reasoning. Overall, CRM provides a practical framework for step-level attribution, enabling more interpretable, robust, and trustworthy multimodal reasoning assessments without retraining.

Abstract

We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of answers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.

Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

TL;DR

This work introduces Contrastive Region Masking (CRM), a training-free diagnostic that causally links each step of multimodal LLMs’ chain-of-thought to specific visual regions by region masking and trace comparison. By evaluating on the VisArgs dataset, CRM uncovers distinct failure modes across state-of-the-art models—some maintain reasoning structure yet hallucinate when evidence is missing, others ground to visual cues but falter under perturbations. The approach shifts evaluation from final answer accuracy to the faithfulness and robustness of intermediate reasoning, framing visual benchmarks as diagnostic tools for interpretability and reliability in multimodal reasoning. Overall, CRM provides a practical framework for step-level attribution, enabling more interpretable, robust, and trustworthy multimodal reasoning assessments without retraining.

Abstract

We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of answers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.

Paper Structure

This paper contains 14 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture for Contrastive Region Masking (CRM), which contrasts original and masked chains of thought (CoT) and their final answers to assess the importance of visual evidence in reasoning.
  • Figure 2: Brain_Loading_Tea: (a) Original Image, (b) Masked Image. Question: "What is being poured into the brain in the image?" A case where Gemini-1.5-Flash, GPT-4o, Qwen-2.5-VL-7b-Instruct, and Llama-3.2-90B-Vision-Instruct models diverge in Answer Flipped and Region Attribution, showcasing the different behavior of the models as a result of masking (Table \ref{['tab:results']}).
  • Figure 3: Fish_Container: (a) Original Image, (b) Masked Image. Question: "What type of container is shown hanging from the fishing hook?" A case where CoT steps were disrupted across all the models, but Gemini-1.5-Flash and Llama-3.2-90B-Vision-Instruct showcase hallucination as well after masking (Table \ref{['tab:results']}).