Table of Contents
Fetching ...

Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection

Jiancheng Pan, Yanxing Liu, Xiao He, Long Peng, Jiahao Li, Yuze Sun, Xiaomeng Huang

TL;DR

This work tackles cross-domain few-shot object detection by leveraging foundation models and introducing Enhance Then Search (ETS), a framework that jointly optimizes mixed image augmentation policies and target-domain sub-space configurations. By constructing a coarse validation set and performing grid-search guided hyperparameter tuning, ETS effectively narrows domain gaps with minimal annotation overhead. Empirical results across six public CD-FSOD benchmarks and several unseen datasets show substantial improvements over baselines, demonstrating robust cross-domain generalization and practical deployment potential for vision-language grounding models in data-scarce environments. Overall, the paper highlights the value of jointly optimizing augmentation strategies and domain subspaces to maximize CD-FSOD performance, offering a scalable approach for adapting powerful models with limited labeled data.

Abstract

Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.

Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection

TL;DR

This work tackles cross-domain few-shot object detection by leveraging foundation models and introducing Enhance Then Search (ETS), a framework that jointly optimizes mixed image augmentation policies and target-domain sub-space configurations. By constructing a coarse validation set and performing grid-search guided hyperparameter tuning, ETS effectively narrows domain gaps with minimal annotation overhead. Empirical results across six public CD-FSOD benchmarks and several unseen datasets show substantial improvements over baselines, demonstrating robust cross-domain generalization and practical deployment potential for vision-language grounding models in data-scarce environments. Overall, the paper highlights the value of jointly optimizing augmentation strategies and domain subspaces to maximize CD-FSOD performance, offering a scalable approach for adapting powerful models with limited labeled data.

Abstract

Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-language models in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.

Paper Structure

This paper contains 16 sections, 2 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic of proposed augmentation-search strategy for cross-domain few-shot object detection.
  • Figure 2: Overall framework of augmentation-search strategy for CD-FSOD, which seamlessly integrates dynamic mixed image augmentation with efficient exploration of domain subspaces.
  • Figure 3: Different image augmentation methods.
  • Figure 4: Grid search strategy searches for the optimal mixed image augmentation in the parameter space.
  • Figure 5: Search strategy experiments for 10-shot detection results on ArTaxOr dataset.
  • ...and 1 more figures