Table of Contents
Fetching ...

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Rang Li, Lei Li, Shuhuai Ren, Hao Tian, Shuhao Gu, Shicheng Li, Zihao Yue, Yudong Wang, Wenhan Ma, Zhe Yang, Jingyuan Ma, Zhifang Sui, Fuli Luo

TL;DR

This work introduces GroundingME, a challenging, multi-dimensional visual grounding benchmark designed to probe whether Multimodal Large Language Models truly ground language in vision rather than exploiting dataset shortcuts. GroundingME combines four challenge dimensions—Discriminative, Spatial, Limited, and Rejection—across 1,005 samples created via a three-stage pipeline (bounding-box annotation, description generation, manual refinement). Evaluations across 25 MLLMs reveal substantial gaps, with the top model at 45.1% accuracy and widespread failure on rejection tasks, underscoring safety concerns for real-world use. The authors propose two practical improvements—test-time thinking with trajectory selection and data-mixture training to foster rejection—demonstrating measurable gains and offering a concrete roadmap toward more reliable, human-like visual grounding.

Abstract

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

TL;DR

This work introduces GroundingME, a challenging, multi-dimensional visual grounding benchmark designed to probe whether Multimodal Large Language Models truly ground language in vision rather than exploiting dataset shortcuts. GroundingME combines four challenge dimensions—Discriminative, Spatial, Limited, and Rejection—across 1,005 samples created via a three-stage pipeline (bounding-box annotation, description generation, manual refinement). Evaluations across 25 MLLMs reveal substantial gaps, with the top model at 45.1% accuracy and widespread failure on rejection tasks, underscoring safety concerns for real-world use. The authors propose two practical improvements—test-time thinking with trajectory selection and data-mixture training to foster rejection—demonstrating measurable gains and offering a concrete roadmap toward more reliable, human-like visual grounding.

Abstract

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.

Paper Structure

This paper contains 32 sections, 6 figures, 25 tables.

Figures (6)

  • Figure 1: Examples of different visual grounding benchmarks. Prior benchmarks (Top) are either too simple or prone to shortcuts. Our proposed GroundingME (Bottom) increases the challenge in four important dimensions. The green bounding box indicates the correct ground-truth object, while the red bounding box shows the answer of Qwen3-VL-30B-A3B-Instruct.
  • Figure 2: The overall data construction pipeline of GroundingME. The process consists of three main stages: (1) Bounding Box Annotation, which utilizes a semi-automated pipeline with RAM++ and GroundingDINO for bounding box generation (§\ref{['subsubsec:bbox']}); (2) Description Generation, which leverages Gemini-2.5-Flash for generating initial referring expressions (§\ref{['subsubsec:description']}); and (3) Manual Selection and Refinement, where human annotators apply rigorous filtering and refinement according to our challenge taxonomy (§\ref{['subsubsec:manual']}).
  • Figure 3: Subtask Distribution of GroundingME. Our benchmark comprises of 1,005 samples, distributed across four L-1 categories and twelve L-2 subcategories.
  • Figure 4: Case study of two different thinking trajectories of Qwen3-VL-235B-A22B-Thinking for the same description. The correct answer is to do rejection and the red bounding box shows the distractor. The correct trajectory (Green) demonstrates rigorous adherence to the description, systematically identifying all attribute mismatches (e.g., short- vs. long-sleeve, blue vs. black pants) and correctly concluding with a null output. In contrast, the erroneous trajectory (Red) acknowledges the same discrepancies but compromises by speculating that the description may be in error, ultimately leading to an incorrect bounding box prediction.
  • Figure 5: Performance gain of different models by enabling thinking mode. Subtask results are provided in the appendix.
  • ...and 1 more figures