Table of Contents
Fetching ...

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

Xuanlong Yu, Youyang Sha, Longfei Liu, Xi Shen, Di Yang

Abstract

Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.

A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps

Abstract

Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. Inspired by ensemble learning, the decoder comprises a shared hierarchical layer followed by multiple parallel decoder branches, where each branch employs denoising queries either inherited from the shared layer or newly initialized to encourage prediction diversity. This design fully exploits pretrained weights without introducing additional parameters, and the resulting diverse predictions can be effectively ensembled to improve generalization. We further leverage a unified progressive fine-tuning framework with a plateau-aware learning rate schedule, which stabilizes optimization and achieves strong few-shot adaptation without complex data augmentations or extensive hyperparameter tuning. Extensive experiments on CD-FSOD, ODinW-13, and RF100-VL validate the effectiveness of our approach. Notably, on RF100-VL, which includes 100 datasets across diverse domains, our method achieves an average performance of 41.9 in the 10-shot setting, significantly outperforming the recent approach SAM3, which obtains 35.7. We further construct a mixed-domain test set from CD-FSOD to evaluate robustness to out-of-distribution (OOD) samples, showing that our proposed modules lead to clear improvement gains. These results highlight the effectiveness, generalization, and robustness of the proposed method. Code is available at: https://github.com/Intellindust-AI-Lab/FT-FSOD.

Paper Structure

This paper contains 39 sections, 10 equations, 3 figures, 19 tables.

Figures (3)

  • Figure 1: Performance on large-scale cross-domain FSOD benchmarks. All methods are adapted from pretrained models. Our approach, based on open-source MMGroundingDINO (MMGDINO) zhao2024open, surpasses prior SOTA (Domain-RAG domainrag, MQ-GLIP-L mqdet2023 and MMGDINO-L zhao2024open) on CD-FSOD fu2024cross, ODinW-13 zhang2022glipv2 and RF100-VL robicheaux2025roboflow100vl benchmarks, and achieves comparable results to fine-tuned SAM3 sam3_2025, notably outperforming it on the largest RF100-VL benchmark.
  • Figure 2: Overview of the proposed hybrid ensemble decoder. Our method extends the original decoder in a transformer-based object detector by parallelizing partial decoder layers on top of the model for the FSOD task. The final detection result is the aggregation of the results given by the object query outputs from all decoder layers. We also randomly replace the original denoising query with the initialized ones during training to further introduce diversity for the parallelized decoder layer input.
  • Figure 3: Illustration on performance reduction of different fine-tuning strategies when the test set contains OOD samples. The performance reductions are highlighted in bold red, and the most robust results are underlined.