Table of Contents
Fetching ...

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao

Abstract

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.

Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection

Abstract

Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. In such regimes, end-to-end grounding detectors need to learn spatial and semantic structure from scratch, wasting precious samples. We ask a simple question: Can explicit reasoning priors help models learn more efficiently when data is scarce? To explore this, we first introduce a Data-efficient Referring Object Detection (De-ROD) task, which is a benchmark protocol for measuring ROD performance in low-data and few-shot settings. We then propose the HeROD (Heuristic-inspired ROD), a lightweight, model-agnostic framework that injects explicit, heuristic-inspired spatial and semantic reasoning priors, which are interpretable signals derived based on the referring phrase, into 3 stages of a modern DETR-style pipeline: proposal ranking, prediction fusion, and Hungarian matching. By biasing both training and inference toward plausible candidates, these priors promise to improve label efficiency and convergence performance. On RefCOCO, RefCOCO+, and RefCOCOg, HeROD consistently outperforms strong grounding baselines in scarce-label regimes. More broadly, our results suggest that integrating simple, interpretable reasoning priors provides a practical and extensible path toward better data-efficient vision-language understanding.
Paper Structure (28 sections, 10 equations, 7 figures, 10 tables)

This paper contains 28 sections, 10 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Conceptual illustration. (a) Without priors, grounding detectors trained with limited labels often ignore spatial/semantic cues (e.g., “on the left”). (b) HeROD injects heuristic-inspired reasoning priors (spatial + semantic), guiding the detector toward correct localization and improving data-efficient learning.
  • Figure 2: HeROD pipeline. Text and image features are encoded, and spatial/semantic reasoning priors are derived from the phrase and image. These priors are injected into reference generation, final prediction, and training loss, guiding the detector toward plausible regions for more data-efficient learning.
  • Figure 3: Visualization of four Spatial Heuristic-inspired Scoring Maps: left-related, top-related, right-related, and bottom-related (listed from the first column to the second column, from the first row to the second row). Brighter colors are used to indicate higher values, while darker colors represent lower values.
  • Figure 4: Visualization of performance comparison across varying data volumes between HeROD-U and the baseline UNINEXT.
  • Figure 5: Visualization of performance comparison across varying data volumes between HeROD-G and the baseline Grounding DINO.
  • ...and 2 more figures