Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment
Zeyu Shangguan, Daniel Seita, Mohammad Rostami
TL;DR
This work addresses cross-domain few-shot object detection by leveraging rich textual information to bridge visual-language gaps. It introduces a meta-learning-based architecture with a Multi-modal Feature Aggregation Module and a Rich Text Semantic Rectification Module, built on a DETR-inspired backbone and enhanced with CLIP features and bidirectional text generation. Across CD-FSOD, CDS-FSOD, PASCAL-VOC, and Mo-FSOD benchmarks, the method substantially outperforms existing FSOD/MM-OD baselines, particularly under cross-domain and few-shot regimes, with ablations confirming the value of both modules and attentive fusion. The findings demonstrate that carefully engineered rich text descriptions and cross-modal alignment can significantly improve domain robustness, albeit with increased computational cost, which the authors show to be reasonable given the performance gains.
Abstract
Advancements in cross-modal feature extraction and integration have significantly enhanced performance in few-shot learning tasks. However, current multi-modal object detection (MM-OD) methods often experience notable performance degradation when encountering substantial domain shifts. We propose that incorporating rich textual information can enable the model to establish a more robust knowledge relationship between visual instances and their corresponding language descriptions, thereby mitigating the challenges of domain shift. Specifically, we focus on the problem of Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD) and introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation. Our new architecture incorporates two key components: (i) A multi-modal feature aggregation module, which aligns visual and linguistic feature embeddings to ensure cohesive integration across modalities. (ii) A rich text semantic rectification module, which employs bidirectional text feature generation to refine multi-modal feature alignment, thereby enhancing understanding of language and its application in object detection. We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.
