Table of Contents
Fetching ...

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming

TL;DR

OddGridBench is introduced, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs and OddGrid-GRPO is proposed, a reinforcement learning framework that integrates curriculum learning and distance-aware reward that significantly enhances the model's fine-grained visual discrimination ability.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

TL;DR

OddGridBench is introduced, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs and OddGrid-GRPO is proposed, a reinforcement learning framework that integrates curriculum learning and distance-aware reward that significantly enhances the model's fine-grained visual discrimination ability.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.
Paper Structure (37 sections, 3 equations, 20 figures, 15 tables)

This paper contains 37 sections, 3 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Illustration of human perceptual visual discrepancy sensitivity, showing the transition from imperceptible to perceptible visual differences in color, rotation, and size.
  • Figure 2: Overview of OddGridBench. OddGridBench encompasses four primary visual attributes, including color, size, rotation, and position, and supports both single-attribute and multi-attribute discrepancy compositions, providing a systematic framework for evaluating the perceptual discrepancy sensitivity of MLLMs.
  • Figure 3: Evaluation results of MLLMs on the OddGridBench. Human performance significantly surpasses all evaluated MLLMs across color, size, rotation, and position dimensions, as well as multi-type combinations.
  • Figure 4: Overview of the OddGridBench data generation pipeline, which constructs grid-based images from collected icons under precisely controlled perceptual conditions for evaluating visual discrepancy sensitivity.
  • Figure 5: Overview of OddGrid-GRPO framework. OddGrid-GRPO integrates curriculum-guided optimization with spatially guided reward shaping to enhance perceptual grounding and improve fine-grained visual discrimination in MLLMs.
  • ...and 15 more figures