Table of Contents
Fetching ...

Cross-domain Multi-modal Few-shot Object Detection via Rich Text

Zeyu Shangguan, Daniel Seita, Mohammad Rostami

TL;DR

This work tackles cross-domain, few-shot multi-modal object detection by leveraging rich textual descriptions as an auxiliary modality to bridge domain gaps. Building on a Meta-DETR backbone, it introduces a multi-modal feature aggregation module to align vision and language support, and a rich text semantic rectify module to reinforce cross-modal understanding during training. Across CD-FSOD benchmarks and standard FSOD tests, the approach yields significant gains, with pronounced improvements when using LLM-generated rich text (e.g., ArTaxOr 1/5/10-shot mAP of 15.1/48.7/61.4). The findings demonstrate that incorporating detailed, domain-relevant text can robustly enhance cross-domain few-shot detection, offering practical benefits for industrial defect detection and related applications.

Abstract

Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks due to generating richer features. However, existing multi-modal object detection (MM-OD) methods degrade when facing significant domain-shift and are sample insufficient. We hypothesize that rich text information could more effectively help the model to build a knowledge relationship between the vision instance and its language description and can help mitigate domain shift. Specifically, we study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method that utilizes rich text semantic information as an auxiliary modality to achieve domain adaptation in the context of FSOD. Our proposed network contains (i) a multi-modal feature aggregation module that aligns the vision and language support feature embeddings and (ii) a rich text semantic rectify module that utilizes bidirectional text feature generation to reinforce multi-modal feature alignment and thus to enhance the model's language understanding capability. We evaluate our model on common standard cross-domain object detection datasets and demonstrate that our approach considerably outperforms existing FSOD methods.

Cross-domain Multi-modal Few-shot Object Detection via Rich Text

TL;DR

This work tackles cross-domain, few-shot multi-modal object detection by leveraging rich textual descriptions as an auxiliary modality to bridge domain gaps. Building on a Meta-DETR backbone, it introduces a multi-modal feature aggregation module to align vision and language support, and a rich text semantic rectify module to reinforce cross-modal understanding during training. Across CD-FSOD benchmarks and standard FSOD tests, the approach yields significant gains, with pronounced improvements when using LLM-generated rich text (e.g., ArTaxOr 1/5/10-shot mAP of 15.1/48.7/61.4). The findings demonstrate that incorporating detailed, domain-relevant text can robustly enhance cross-domain few-shot detection, offering practical benefits for industrial defect detection and related applications.

Abstract

Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks due to generating richer features. However, existing multi-modal object detection (MM-OD) methods degrade when facing significant domain-shift and are sample insufficient. We hypothesize that rich text information could more effectively help the model to build a knowledge relationship between the vision instance and its language description and can help mitigate domain shift. Specifically, we study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method that utilizes rich text semantic information as an auxiliary modality to achieve domain adaptation in the context of FSOD. Our proposed network contains (i) a multi-modal feature aggregation module that aligns the vision and language support feature embeddings and (ii) a rich text semantic rectify module that utilizes bidirectional text feature generation to reinforce multi-modal feature alignment and thus to enhance the model's language understanding capability. We evaluate our model on common standard cross-domain object detection datasets and demonstrate that our approach considerably outperforms existing FSOD methods.
Paper Structure (20 sections, 4 equations, 5 figures, 6 tables)

This paper contains 20 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Different FSOD tasks. The classic FSOD task (top) involves only visual information. MM-FSOD (middle) introduces a language modality to provide extra information to improve FSOD performance. In contrast to these, our proposed CDMM-FSOD task (bottom) is tailored for cross-domain scenarios and extends MM-FSOD to use richer text. Above, we show a cross-domain example where a model might train on images and text of common data (such as birds) and needs to generalize to detection of less common, substantially different data (such as patch defects).
  • Figure 2: Performance results on 10-shot object detection on multiple cross-domain, few-shot datasets. We observe substantial cross-domain degradation for existing MM-OD Next-chat zhang2023nextchat, GOAT Wang23GOAT, and ViLD gu2022openvocabulary, as well as for a single-modal detection method, Meta-DETR Zhang23MetaDETR. In contrast, our proposed method has stronger performance on out-of-domain data.
  • Figure 3: The overall structure of our model. We indicate our proposed multi-modal feature aggregation module and rich text rectify module with the red blocks; see \ref{['fig:module']} for more details about their structure. The multi-modal feature aggregation module is responsible for the cross-modality feature embedding mix. The rich text rectify module reinforces the model's cross-modality understanding. We design this end-to-end model that takes a set of support and query images, as well as a group of rich category text as input for training, and outputs the object detection results of the query images.
  • Figure 4: Details of our meta-learning multi-modal aggregation module (upper region) and the rich semantic rectify module (lower region). We use different colors for different feature branches. The rich semantic rectify module is only used during training, and not at test time.
  • Figure 5: Representative visualizations of detection results on three benchmark datasets. We compare our method with a multi-modal object detection model (Next-chat) and a few-shot object detection model (Meta-DETR). Our proposed model obtains more accurate bounding boxes and improved detection confidence.