Table of Contents
Fetching ...

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu

TL;DR

This work introduces Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target, and conducts comprehensive ablations to show that solving Ref-Adv requires reasoning beyond simple cues.

Abstract

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

TL;DR

This work introduces Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target, and conducts comprehensive ablations to show that solving Ref-Adv requires reasoning beyond simple cues.

Abstract

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.
Paper Structure (23 sections, 6 figures, 7 tables)

This paper contains 23 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Common limitations of classic referring expression benchmarks that reduce the reasoning challenge. These include very short expressions, few visual distractors, and overspecified descriptors that enable shortcut matching without requiring genuine reasoning. The cyan box highlights the ground truth region.
  • Figure 2: Accuracy@0.5 (IoU $\geq$ 0.5) of Qwen on the RefCOCO/+/g validation sets. Marker size is proportional to the number of samples in each bin. (a) is the Acc@0.5 on number of words in expressions, (b) is on distractor count. We can see most cases have short expressions and few distractors.
  • Figure 3: LLM-authored data curation pipeline for Ref-Adv. (a) Prepare Image: filter images, ensure $\geq$ 3 distractors, and add number tags to candidate instances. (b) Similarity Judgement: use GPT-4o to identify the most similar pair and elicit group-level and instance-level discriminators. (c) Expression Generation: compose minimally sufficient referring expressions using discriminators and optional negation. (d) Human Verification: verify expression accuracy and confirm the existence of hard distractors before inclusion.
  • Figure 4: Dataset statistics across REC benchmarks. (a) Expression length comparison. (b) Distribution of distractor counts. (c) Instance size on a log area scale.
  • Figure 5: Performance of representative multimodal LLMs on Ref-Adv. We include qualitative examples with and without CoT for Gemini 2.5-Flash and Qwen2.5-VL-72B. CoT answers are shown in a gray box. Hard distractors in Ref-Adv challenge current MLLMs.
  • ...and 1 more figures