Table of Contents
Fetching ...

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez

Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Paper Structure

This paper contains 63 sections, 13 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Referring Scenario Comprehension (RSC) vs. traditional referring expression comprehension (REC). Each row shows the same target object under both paradigms. Traditional REC queries often name the target category directly, allowing success via lexical matching. RSC instead pairs each image with a lengthy scenario-based query specifying a user role, goal, and multiple disambiguating cues, including explicit contrasts against competing objects, and requires output identifying both the target object and its bounding box. The RSC difficulty tags (U/C/S/O/P: Uniqueness, Clutter, Size, Overlap, Position) characterize each instance, enabling fine-grained training and evaluation.
  • Figure 2: Phase 1 filters and balances source instances computing five interpretable difficulty tags to form a tag-balanced candidate pool. Phase 2 generates annotations via a two-stage process: a small-scale refinement loop first validates the generation prompt through iterative system refinement and human audit, before large-scale scenario generation is applied to the full candidate pool. Each instance is annotated with a target object description, a scenario query with reasoning traces, acceptable category aliases, and an LLM-predicted bounding box for quality filtering. Phase 3 applies automatic and human quality control: a quality judge filters rough annotations, human auditors verify a stratified sample. The final RSC dataset provides, per instance, a scenario query, reasoning traces, acceptable names, ground-truth box, and difficulty tags.
  • Figure 3: ScenGround prompt and output schema. Given an image and a user-driven scenario, the model is instructed to reason inside < think> and to emit a structured JSON inside < answer>. The JSON contains target_object and bbox[x,y,w,h]. Scenarios avoid category names and force disambiguation via attributes, relations, and spatial cues, while < think> must justify the selection and ignore distractors. This schema is used in both TP-SFT and IC-GRPO. To increase the robustness of ScenGround, IC-GRPO uses 7 other templates.
  • Figure 4: Qualitative results on RSC. Red box for ground truth and blue box for ScenGround prediction.
  • Figure 5: Query length distribution. RSC queries are substantially longer than RefCOCO+ and RefCOCOg expressions, reflecting the paragraph-length scenario descriptions that specify user roles, goals, and disambiguating cues. RefCOCO+ and RefCOCOg peak below 10 words; RSC peaks around 50--60 words.
  • ...and 13 more figures