Table of Contents
Fetching ...

HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

TL;DR

HueManity introduces a scalable benchmark with 83,850 Ishihara-style images to probe fine-grained visual perception in Multimodal Large Language Models. It reveals a substantial perceptual grounding gap, with top MLLMs scoring only 33.6% on numeric and 3% on alphanumeric tasks, far below human and ResNet-50 baselines. Neither in-context learning nor limited fine-tuning remedies the deficit, pointing to architectural bottlenecks rather than task difficulty. The benchmark’s strong alignment with real-world performance on Vision Arena reinforces its practical relevance for developing perceptually robust, safety-critical multimodal systems. The work also provides open-source generation code to spur further research into perceptually grounded MLLMs.

Abstract

Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans (99.38%, 93.25%) and a fine-tuned ResNet-50 (96.5%, 94.5%). These findings expose a critical weakness in MLLMs' perceptual grounding, one that remains obscured by conventional benchmarks emphasizing high-level semantics.

HueManity: Probing Fine-Grained Visual Perception in MLLMs

TL;DR

HueManity introduces a scalable benchmark with 83,850 Ishihara-style images to probe fine-grained visual perception in Multimodal Large Language Models. It reveals a substantial perceptual grounding gap, with top MLLMs scoring only 33.6% on numeric and 3% on alphanumeric tasks, far below human and ResNet-50 baselines. Neither in-context learning nor limited fine-tuning remedies the deficit, pointing to architectural bottlenecks rather than task difficulty. The benchmark’s strong alignment with real-world performance on Vision Arena reinforces its practical relevance for developing perceptually robust, safety-critical multimodal systems. The work also provides open-source generation code to spur further research into perceptually grounded MLLMs.

Abstract

Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans (99.38%, 93.25%) and a fine-tuned ResNet-50 (96.5%, 94.5%). These findings expose a critical weakness in MLLMs' perceptual grounding, one that remains obscured by conventional benchmarks emphasizing high-level semantics.

Paper Structure

This paper contains 26 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: We present HueManity - an automated scalable benchmark for evaluation of fine-grained visual perception in MLLMs. The pipeline (left) embeds characters within challenging Ishihara-style image patterns, while ensuring human readability of the generated images. Experiments reveal (right) that humans and a fine-tuned ResNet50 baseline significantly outperform top-5 leading closed-source and open-source MLLMs, exposing a critical lack of fine-grained understanding.
  • Figure 2: Qualitative examples showing predictions of 4 representative MLLMs vs baselines on numeric and alphanumeric tasks.
  • Figure 3: Distribution of CIEDE2000 color difference scores for the 25 selected foreground-background color pairs utilized in the HueManity benchmark.