Table of Contents
Fetching ...

Referring Expression Comprehension for Small Objects

Kanoko Goto, Takumi Hirose, Mahiro Ukai, Shuhei Kurita, Nakamasa Inoue

TL;DR

This work tackles the difficulty of referring expression comprehension for extremely small objects in autonomous driving. It introduces SOREC, a dataset of 100,000 referring-expression–bounding-box pairs for tiny road objects, and PIZA, a progressive-iterative zooming adapter for parameter-efficient fine-tuning that enables autoregressive zooming to localize targets. Experiments show that applying PIZA to GroundingDINO yields significant accuracy gains with far fewer trainable parameters, outperforming baselines and ablations across multiple settings. The dataset and method together advance small-object REC and offer practical implications for safe, precise object localization in real-world driving scenarios.

Abstract

Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.

Referring Expression Comprehension for Small Objects

TL;DR

This work tackles the difficulty of referring expression comprehension for extremely small objects in autonomous driving. It introduces SOREC, a dataset of 100,000 referring-expression–bounding-box pairs for tiny road objects, and PIZA, a progressive-iterative zooming adapter for parameter-efficient fine-tuning that enables autoregressive zooming to localize targets. Experiments show that applying PIZA to GroundingDINO yields significant accuracy gains with far fewer trainable parameters, outperforming baselines and ablations across multiple settings. The dataset and method together advance small-object REC and offer practical implications for safe, precise object localization in real-world driving scenarios.

Abstract

Referring expression comprehension (REC) aims to localize the target object described by a natural language expression. Recent advances in vision-language learning have led to significant performance improvements in REC tasks. However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving. To address this issue, we introduce a novel dataset and method for REC targeting small objects. First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios. Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects. In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset. Our dataset, codes and pre-trained models are publicly available on the project page.

Paper Structure

This paper contains 17 sections, 6 equations, 18 figures, 14 tables.

Figures (18)

  • Figure 1: The SOREC dataset consists of pairs of referring expressions and bounding boxes for extremely small objects. (a) Existing approach fine-tunes a model $F$ to localize the target. (b) Our approach fine-tunes $F$ to progressively zoom in and localize the target in an autoregressive manner. (c) Example of prediction in three zooming steps.
  • Figure 2: Dataset comparison. (a) RefCOCO is a representative REC dataset consisting of expressions and bounding boxes for normal-sized objects. (d) SOREC is our dataset, consisting of relatively longer expressions compared to RefCOCO, to identify small objects. (c-e) Comparison of word count, image size, and relative bounding box size distributions on test sets.
  • Figure 3: Word clouds for RefCOCO (left) and SOREC (right).
  • Figure 4: Fine-tuning with PIZA. Given a pre-trained model $F$, PIZA produces a model $F_{\space\text{$\bigcirc$} \space{*}}$ that zooms in to localize small objects in an autoregressive manner through fine-tuning. In the inference phase, bounding boxes $\bm{b}_{0}, \bm{b}_{1}, \cdots, \bm{b}_{T}$ indicating zooming steps are predicted to localize the target at the end.
  • Figure 5: PIZA module.
  • ...and 13 more figures