Table of Contents
Fetching ...

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua

TL;DR

UCDR-Adapter tackles universal cross-domain retrieval by integrating adapter modules into pre-trained vision–language models and introducing dynamic, two-phase prompts. Phase 1 learns class- and domain-specific prompts via a Learnable Textual Semantic Template and momentum-updated prompts with dual losses for strong multimodal alignment. Phase 2 generates target prompts by attending over masked source prompts, enabling adaptation to unseen domains and classes, with test-time inference relying only on the image branch and generated prompts. Across DomainNet, Sketchy, and TU-Berlin, UCDR-Adapter delivers superior retrieval performance under UCDR, $U^{ ext{d}}$CDR, and $U^{ ext{c}}$CDR, demonstrating improved generalization with practical efficiency for real-world deployment.

Abstract

Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings.

UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval

TL;DR

UCDR-Adapter tackles universal cross-domain retrieval by integrating adapter modules into pre-trained vision–language models and introducing dynamic, two-phase prompts. Phase 1 learns class- and domain-specific prompts via a Learnable Textual Semantic Template and momentum-updated prompts with dual losses for strong multimodal alignment. Phase 2 generates target prompts by attending over masked source prompts, enabling adaptation to unseen domains and classes, with test-time inference relying only on the image branch and generated prompts. Across DomainNet, Sketchy, and TU-Berlin, UCDR-Adapter delivers superior retrieval performance under UCDR, CDR, and CDR, demonstrating improved generalization with practical efficiency for real-world deployment.

Abstract

Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings.

Paper Structure

This paper contains 12 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of UCDR settings. Training involves seen categories (e.g., Lion) and domains (e.g., Real). Testing includes unseen domains (e.g., Sketch) and categories (e.g., Cookie) using U$^d$CDR and U$^c$CDR principles. Unlike Zero-Shot Domain Generalization (ZSDG) mondal2022seicmangla2022cocoaarfeen2022handling, UCDR does not rely on true labels for unseen data, aligning better with real-world scenarios.
  • Figure 2: UCDR-Adapter architecture. In Phase 1 (top), Source Adapter Learning optimizes class and domain prompts via a momentum encoder and dual loss functions for aligned multimodal representations. Only relevant prompts are activated based on the input image. In Phase 2 (bottom), the Target Prompt Generation module generates adapted prompts by attending over masked source prompts, simulating adaptation to unseen domains and classes. At test time (right), only the image branch is utilized with the generated target prompts for effective retrieval without textual cues.
  • Figure 3: Phase 1 loss function. Right: ITC loss. Left: Triplet Loss. Solid circles are image features, boxes represent the semantic features of each class, and the dashed circles correspond to the image domain.
  • Figure 4: Target Prompt Generation process. Gray indicates that the class and domain prompts are masked. Where ${P}_d$ and ${P}_c$ are target prompts generated for unseen domains and classes.
  • Figure 5: The results of using UCDR-Adapter for the UCDR on DomainNet. The 'Cloud' class from the holdout domain as the query.
  • ...and 1 more figures