Table of Contents
Fetching ...

Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment

Zeyu Shangguan, Daniel Seita, Mohammad Rostami

TL;DR

This work addresses cross-domain few-shot object detection by leveraging rich textual information to bridge visual-language gaps. It introduces a meta-learning-based architecture with a Multi-modal Feature Aggregation Module and a Rich Text Semantic Rectification Module, built on a DETR-inspired backbone and enhanced with CLIP features and bidirectional text generation. Across CD-FSOD, CDS-FSOD, PASCAL-VOC, and Mo-FSOD benchmarks, the method substantially outperforms existing FSOD/MM-OD baselines, particularly under cross-domain and few-shot regimes, with ablations confirming the value of both modules and attentive fusion. The findings demonstrate that carefully engineered rich text descriptions and cross-modal alignment can significantly improve domain robustness, albeit with increased computational cost, which the authors show to be reasonable given the performance gains.

Abstract

Advancements in cross-modal feature extraction and integration have significantly enhanced performance in few-shot learning tasks. However, current multi-modal object detection (MM-OD) methods often experience notable performance degradation when encountering substantial domain shifts. We propose that incorporating rich textual information can enable the model to establish a more robust knowledge relationship between visual instances and their corresponding language descriptions, thereby mitigating the challenges of domain shift. Specifically, we focus on the problem of Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD) and introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation. Our new architecture incorporates two key components: (i) A multi-modal feature aggregation module, which aligns visual and linguistic feature embeddings to ensure cohesive integration across modalities. (ii) A rich text semantic rectification module, which employs bidirectional text feature generation to refine multi-modal feature alignment, thereby enhancing understanding of language and its application in object detection. We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.

Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment

TL;DR

This work addresses cross-domain few-shot object detection by leveraging rich textual information to bridge visual-language gaps. It introduces a meta-learning-based architecture with a Multi-modal Feature Aggregation Module and a Rich Text Semantic Rectification Module, built on a DETR-inspired backbone and enhanced with CLIP features and bidirectional text generation. Across CD-FSOD, CDS-FSOD, PASCAL-VOC, and Mo-FSOD benchmarks, the method substantially outperforms existing FSOD/MM-OD baselines, particularly under cross-domain and few-shot regimes, with ablations confirming the value of both modules and attentive fusion. The findings demonstrate that carefully engineered rich text descriptions and cross-modal alignment can significantly improve domain robustness, albeit with increased computational cost, which the authors show to be reasonable given the performance gains.

Abstract

Advancements in cross-modal feature extraction and integration have significantly enhanced performance in few-shot learning tasks. However, current multi-modal object detection (MM-OD) methods often experience notable performance degradation when encountering substantial domain shifts. We propose that incorporating rich textual information can enable the model to establish a more robust knowledge relationship between visual instances and their corresponding language descriptions, thereby mitigating the challenges of domain shift. Specifically, we focus on the problem of Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD) and introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation. Our new architecture incorporates two key components: (i) A multi-modal feature aggregation module, which aligns visual and linguistic feature embeddings to ensure cohesive integration across modalities. (ii) A rich text semantic rectification module, which employs bidirectional text feature generation to refine multi-modal feature alignment, thereby enhancing understanding of language and its application in object detection. We evaluate the proposed method on common cross-domain object detection benchmarks and demonstrate that it significantly surpasses existing few-shot object detection approaches.

Paper Structure

This paper contains 41 sections, 4 equations, 15 figures, 18 tables.

Figures (15)

  • Figure 1: Different FSOD tasks: The classic FSOD task (top) relies solely on visual information for object detection. MM-FSOD (middle) enhances FSOD performance by incorporating a language modality, providing additional contextual information. Building on this approach, our proposed CDMM-FSOD task (bottom) is specifically designed for cross-domain scenarios, extending MM-FSOD by utilizing richer, more detailed text descriptions. The example demonstrates a cross-domain challenge: the model is trained on images and text of common objects (e.g., birds) but must generalize to detect significantly different and less common objects (e.g., patch defects).
  • Figure 2: Performance results on 10-shot object detection on multiple cross-domain, few-shot datasets: We observe substantial cross-domain degradation for existing MM-OD Next-chat zhang2023nextchat, GOAT Wang23GOAT, and ViLD gu2022openvocabulary, as well as for a single-modal detection method, Meta-DETR Zhang23MetaDETR. In contrast, our proposed method has stronger performance on out-of-domain data.
  • Figure 3: The proposed architecture: the proposed "multi-modal feature aggregation module" and "rich text rectification module" are highlighted in red blocks in Figure \ref{['fig:module']}, which provides further details about their design and structure. The "multi-modal feature aggregation module" facilitates the fusion of features across different modalities, enabling effective cross-modal embedding integration. Meanwhile, the "rich text rectification module" enhances the model's ability to comprehend and leverage information from both image and text modalities. We have designed this model to operate in an "end-to-end manner", processing a set of support images, query images, and a collection of rich category-specific textual descriptions during training. The model then outputs the detection results for the objects present in the query images, effectively combining visual and textual inputs for improved performance.
  • Figure 4: Details of our meta-learning multi-modal aggregation module (upper region) and the rich semantic rectify module (lower region). Different colors are used for different feature branches. The rich semantic rectify module is used only during training, and not at test time.
  • Figure 5: Examples of images and categories from ArTaxOr.
  • ...and 10 more figures