Cross-domain Multi-modal Few-shot Object Detection via Rich Text
Zeyu Shangguan, Daniel Seita, Mohammad Rostami
TL;DR
This work tackles cross-domain, few-shot multi-modal object detection by leveraging rich textual descriptions as an auxiliary modality to bridge domain gaps. Building on a Meta-DETR backbone, it introduces a multi-modal feature aggregation module to align vision and language support, and a rich text semantic rectify module to reinforce cross-modal understanding during training. Across CD-FSOD benchmarks and standard FSOD tests, the approach yields significant gains, with pronounced improvements when using LLM-generated rich text (e.g., ArTaxOr 1/5/10-shot mAP of 15.1/48.7/61.4). The findings demonstrate that incorporating detailed, domain-relevant text can robustly enhance cross-domain few-shot detection, offering practical benefits for industrial defect detection and related applications.
Abstract
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks due to generating richer features. However, existing multi-modal object detection (MM-OD) methods degrade when facing significant domain-shift and are sample insufficient. We hypothesize that rich text information could more effectively help the model to build a knowledge relationship between the vision instance and its language description and can help mitigate domain shift. Specifically, we study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method that utilizes rich text semantic information as an auxiliary modality to achieve domain adaptation in the context of FSOD. Our proposed network contains (i) a multi-modal feature aggregation module that aligns the vision and language support feature embeddings and (ii) a rich text semantic rectify module that utilizes bidirectional text feature generation to reinforce multi-modal feature alignment and thus to enhance the model's language understanding capability. We evaluate our model on common standard cross-domain object detection datasets and demonstrate that our approach considerably outperforms existing FSOD methods.
