Table of Contents
Fetching ...

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

Wanqi Wang, Jingcai Guo, Yuxiang Cai, Zhi Chen

TL;DR

Cross-domain few-shot object detection struggles with domain shift and limited target-domain labels. The authors propose Learning Multi-modal Prototypes (LMP), a dual-branch detector that fuses open-vocabulary text guidance with target-domain visual prototypes, including a Visual Prototype Construction module that builds class prototypes from support RoIs and generates hard-negative prototypes by jittering ground-truth boxes; both branches are trained jointly and ensembled at inference. LMP achieves state-of-the-art or competitive mAP on six CD-FSOD datasets across 1/5/10-shot settings, with notable gains in the highly scarce 1-shot regime. By grounding detection in both semantic representations and domain-specific visual cues, LMP enhances localization under domain shift while maintaining open-vocabulary capabilities.

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

TL;DR

Cross-domain few-shot object detection struggles with domain shift and limited target-domain labels. The authors propose Learning Multi-modal Prototypes (LMP), a dual-branch detector that fuses open-vocabulary text guidance with target-domain visual prototypes, including a Visual Prototype Construction module that builds class prototypes from support RoIs and generates hard-negative prototypes by jittering ground-truth boxes; both branches are trained jointly and ensembled at inference. LMP achieves state-of-the-art or competitive mAP on six CD-FSOD datasets across 1/5/10-shot settings, with notable gains in the highly scarce 1-shot regime. By grounding detection in both semantic representations and domain-specific visual cues, LMP enhances localization under domain shift while maintaining open-vocabulary capabilities.

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.
Paper Structure (15 sections, 12 equations, 6 figures, 3 tables)

This paper contains 15 sections, 12 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Text vs. visual prompting for cross-domain few-shot detection. (a) Text-only prompts encode high-level semantics but miss target-domain appearance, leading to weak localization under domain shift. (b) Adding raw visual prompts (support images) enriches semantics but still lacks structured, class-specific guidance. (c) Our approach constructs compact visual prototypes from support images and injects them alongside text features into the detector, which provides domain-adaptive capability for robust FSOD.
  • Figure 2: Overview of the proposed LMP framework for CD-FSOD. From a few labeled support images, we build class-level visual prototypes and, for each ground-truth in a query image, sample K hard-negative boxes via random jittering. A visual-guided branch injects these prototypes into the detection pipeline, while a text-guided branch preserves open-vocabulary semantics. The two branches are trained jointly and ensembled at inference, coupling domain-invariant text features with target-domain appearance for robust few-shot detection.
  • Figure 3: (a) Ablation study results reported on six target-domain datasets, 5-shot. (b) Impact of (left) the number of hard negative prototypes and (right) the loss weighting factor $\alpha$ on detection performance across 1/5/10-shot settings on ArTaxOr dataset.
  • Figure 4: t-SNE visualization of prototype embeddings in ArTaxOr. Triangles are query features of detections and circles are hard-negative features mined around ground-truth boxes. Colors denote different classes. Hard negatives cluster along decision boundaries, while query features form different class groups, which shows how the visual branch separates confusing objects.
  • Figure 5: Qualitative comparison on four target domains. Each triplet shows detections from a text-only prototype baseline (left), our dual-branch method (middle), and ground truth (right). Our visual+text prototypes yield tighter boxes and fewer confusions: clipart scenes reduce spurious boxes on background objects; steel-surface images better separate fine-grained defects; underwater scenes recover more small fish; insect images avoid duplicate boxes and improve localization.
  • ...and 1 more figures