ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Kaipeng Fang; Jingkuan Song; Lianli Gao; Pengpeng Zeng; Zhi-Qi Cheng; Xiyao Li; Heng Tao Shen

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Kaipeng Fang, Jingkuan Song, Lianli Gao, Pengpeng Zeng, Zhi-Qi Cheng, Xiyao Li, Heng Tao Shen

TL;DR

This work tackles Universal Cross-Domain Retrieval (UCDR) by introducing ProS, a prompt-tuning framework that simulates generalized knowledge through Content-aware Dynamic Prompts (CaDP). ProS comprises two stages: Prompt Units Learning, which builds domain and semantic prompt units using a mask-and-align strategy, and Context-aware Simulator Learning, which trains a CaPS to generate CaDP under simulated test conditions. CaDP then steers the CLIP image encoder to produce more generalizable embeddings for unseen domains and categories, enabling robust retrieval. Extensive experiments on DomainNet, Sketchy, and TU-Berlin demonstrate state-of-the-art performance with a modest parameter budget, and ablation studies confirm the importance of each component. The approach offers a practical, scalable path to open-set cross-domain search with publicly available code.

Abstract

The goal of Universal Cross-Domain Retrieval (UCDR) is to achieve robust performance in generalized test scenarios, wherein data may belong to strictly unknown domains and categories during training. Recently, pre-trained models with prompt tuning have shown strong generalization capabilities and attained noteworthy achievements in various downstream tasks, such as few-shot learning and video-text retrieval. However, applying them directly to UCDR may not sufficiently to handle both domain shift (i.e., adapting to unfamiliar domains) and semantic shift (i.e., transferring to unknown categories). To this end, we propose \textbf{Pro}mpting-to-\textbf{S}imulate (ProS), the first method to apply prompt tuning for UCDR. ProS employs a two-step process to simulate Content-aware Dynamic Prompts (CaDP) which can impact models to produce generalized features for UCDR. Concretely, in Prompt Units Learning stage, we introduce two Prompt Units to individually capture domain and semantic knowledge in a mask-and-align way. Then, in Context-aware Simulator Learning stage, we train a Content-aware Prompt Simulator under a simulated test scenarios to produce the corresponding CaDP. Extensive experiments conducted on three benchmark datasets show that our method achieves new state-of-the-art performance without bringing excessive parameters. Our method is publicly available at https://github.com/fangkaipeng/ProS.

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

TL;DR

Abstract

Paper Structure (15 sections, 8 equations, 5 figures, 4 tables)

This paper contains 15 sections, 8 equations, 5 figures, 4 tables.

Introduction
Related Work
Universal Cross-Domain Retrieval
Vision-Language Pre-training Models
Prompt Tuning
Method
Preliminary
Prompt-to-Simulate
Retrieval by ProS
Experiment
Experimental setting
Main Results
Ablation Study
Qualitative Analysis
Conclusion

Figures (5)

Figure 1: (a) Illustration of Cross-Domain Retrieval (CDR) and its generalized version (UCDR). (b) Comparison of our ProS ☆ with different backbones $\bigcirc$ and various prompt-based methods $\triangle$ under UCDR protocol. All prompt-based methods use CLIP as the backbone. Our method yields solid improvement and achieves a better trade-off between performance and trainable parameters usage against state-of-the art.
Figure 2: Overview of our proposed ProS. In Prompt Units Learning Stage, we capture knowledge from source data into domain prompts units $DP$ and semantic prompts units $SP$ by masking irrelevance prompts. In the Context-aware Prompt Simulation Stage, we train a Context-aware Prompt Simulator (CaPS) with a mask operation to dynamically convey prompt templates $PT$ to two Content-aware Dynamic Prompts (CaDP) to simulate unknown domains and categories. In the retrieval phase, we employ CaPS to produce CaDP which impacts the CLIP image encoder to convert unseen samples into suitable embeddings for retrieval. The gray parts indicate masked prompts.
Figure 3: Evaluation results of two prompts length. (a) investigate the impact of text prompt length. (b) analyze Content-aware Dynamic Prompt length generated by CaPS $\mathcal{M}$, where 0-0 represents VPT and 1-1 means one CaDP for domain and one for semantic.
Figure 4: Visualization of image features from 10 randomly selected unseen classes of Real Query and unseen Infograph Gallery set. Different colors represent different categories while $\bigcirc$ and $\bigtriangleup$ represent samples from real and Infograph domains, respectively. We further evaluate performance by metric from LBHash, i.e., $\sigma=\frac{\max\mathcal{D_{\mathit{intra}}}}{\min\mathcal{D_{\mathit{inter}}}}$ (lower is better).
Figure 5: Retrieval Results under UCDR protocols on DomainNet. (a) displays the retrieval results of "Peas" by the query from an unseen query domain. (b) shows the retrieval results of a few queries from Quickdraw. True positives and false positives are shown with green and red borders, respectively.

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

TL;DR

Abstract

ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (5)