HueManity: Probing Fine-Grained Visual Perception in MLLMs
Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande
TL;DR
HueManity introduces a scalable benchmark with 83,850 Ishihara-style images to probe fine-grained visual perception in Multimodal Large Language Models. It reveals a substantial perceptual grounding gap, with top MLLMs scoring only 33.6% on numeric and 3% on alphanumeric tasks, far below human and ResNet-50 baselines. Neither in-context learning nor limited fine-tuning remedies the deficit, pointing to architectural bottlenecks rather than task difficulty. The benchmark’s strong alignment with real-world performance on Vision Arena reinforces its practical relevance for developing perceptually robust, safety-critical multimodal systems. The work also provides open-source generation code to spur further research into perceptually grounded MLLMs.
Abstract
Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans (99.38%, 93.25%) and a fine-tuned ResNet-50 (96.5%, 94.5%). These findings expose a critical weakness in MLLMs' perceptual grounding, one that remains obscured by conventional benchmarks emphasizing high-level semantics.
