Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs
Isha Chaturvedi, Anjana Nair, Yushen Li, Adhitya Rajendra Kumar, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma
TL;DR
This work introduces Contrastive Region Masking (CRM), a training-free diagnostic that causally links each step of multimodal LLMs’ chain-of-thought to specific visual regions by region masking and trace comparison. By evaluating on the VisArgs dataset, CRM uncovers distinct failure modes across state-of-the-art models—some maintain reasoning structure yet hallucinate when evidence is missing, others ground to visual cues but falter under perturbations. The approach shifts evaluation from final answer accuracy to the faithfulness and robustness of intermediate reasoning, framing visual benchmarks as diagnostic tools for interpretability and reliability in multimodal reasoning. Overall, CRM provides a practical framework for step-level attribution, enabling more interpretable, robust, and trustworthy multimodal reasoning assessments without retraining.
Abstract
We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attribution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of answers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.
