The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
Tejas Anvekar, Fenil Bardoliya, Pavan K. Turaga, Chitta Baral, Vivek Gupta
TL;DR
This work introduces The Perceptual Observatory, a principled framework for evaluating perceptual robustness and vision-language grounding in multimodal LLMs using controlled perturbations across two domains: face identity and text-in-vision. By pairing pixel-based augmentations with diffusion-based stylized illusions and evaluating three core tasks—image matching, grid pointing, and attribute localization—the study reveals that scaling the language component often fails to improve perceptual grounding when the vision encoder is fixed. Key findings show that identity robustness varies with perturbation type, grounding under OOD remains challenging, and thinking-enabled decoding can help in some cases while hindering transfer, especially for faces. The paper provides a reusable dataset and pipeline, offering actionable insights for joint vision-language alignment and fair, robust MLLM design beyond traditional end-task benchmarks.
Abstract
Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.
