Table of Contents
Fetching ...

Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang

TL;DR

This paper addresses cross-domain few-shot object detection (CD-FSOD) by introducing Domain-RAG, a training-free, retrieval-guided compositional image generation framework that fixes the foreground and adapts the background. It uses a three-stage pipeline—domain-aware background retrieval, domain-guided background generation, and foreground-background composition—to synthesize domain-aligned training samples without retraining detectors. The method leverages retrieval priors from COCO, Redux and Flux diffusion tools to produce high-quality, domain-consistent backgrounds and seamlessly composites them with preserved foregrounds. Domain-RAG achieves state-of-the-art results across CD-FSOD, RS-FSOD, and Camouflaged FSOD, demonstrating robust improvements in low-shot, cross-domain settings and providing a practical, plug-and-play augmentation for detection models.

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

TL;DR

This paper addresses cross-domain few-shot object detection (CD-FSOD) by introducing Domain-RAG, a training-free, retrieval-guided compositional image generation framework that fixes the foreground and adapts the background. It uses a three-stage pipeline—domain-aware background retrieval, domain-guided background generation, and foreground-background composition—to synthesize domain-aligned training samples without retraining detectors. The method leverages retrieval priors from COCO, Redux and Flux diffusion tools to produce high-quality, domain-consistent backgrounds and seamlessly composites them with preserved foregrounds. Domain-RAG achieves state-of-the-art results across CD-FSOD, RS-FSOD, and Camouflaged FSOD, demonstrating robust improvements in low-shot, cross-domain settings and providing a practical, plug-and-play augmentation for detection models.

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Given images from distinct novel domains, we compare generation results of baseline methods (a–c) and our approach (d), and illustrate the main pipeline of our Domain-RAG (e).
  • Figure 2: Illustration of our Domain-RAG. Built on our principle of "fix the foreground, adapt the background", we first decompose image and process it with three key modules: domain-aware background retrieval, domain-guided background generation, and foreground-background composition.
  • Figure 3: (a) Ablation study on modules, results reported on NEU-DET, 1 shot. (b) Visualization of target image (top row) and generated image (second row). (c) Failure Cases.
  • Figure 4: Visualization comparison between DomainRAG and other augmentation methods.
  • Figure 5: Visualization of our data generation pipeline. From left to right: (1) the target query image, (2) retrieved backgrounds, (3) generated new backgrounds, and (4) the final synthesized image used for further model finetuning.
  • ...and 1 more figures