Table of Contents
Fetching ...

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

Jing Yang, Hui Xue, Shipeng Zhu, Pengfei Fang

Abstract

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.

Text-Phase Synergy Network with Dual Priors for Unsupervised Cross-Domain Image Retrieval

Abstract

This paper studies unsupervised cross-domain image retrieval (UCDIR), which aims to retrieve images of the same category across different domains without relying on labeled data. Existing methods typically utilize pseudo-labels, derived from clustering algorithms, as supervisory signals for intra-domain representation learning and cross-domain feature alignment. However, these discrete pseudo-labels often fail to provide accurate and comprehensive semantic guidance. Moreover, the alignment process frequently overlooks the entanglement between domain-specific and semantic information, leading to semantic degradation in the learned representations and ultimately impairing retrieval performance. This paper addresses the limitations by proposing a Text-Phase Synergy Network with Dual Priors(TPSNet). Specifically, we first employ CLIP to generate a set of class-specific prompts per domain, termed as domain prompt, serving as a text prior that offers more precise semantic supervision. In parallel, we further introduce a phase prior, represented by domain-invariant phase features, which is integrated into the original image representations to bridge the domain distribution gaps while preserving semantic integrity. Leveraging the synergy of these dual priors, TPSNet significantly outperforms state-of-the-art methods on UCDIR benchmarks.
Paper Structure (40 sections, 17 equations, 10 figures, 13 tables)

This paper contains 40 sections, 17 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Comparison between (a) existing methods and (b) our proposed TPSNet. Existing methods rely on inaccurate pseudo-labels for intra-domain and cross-domain learning, often causing semantic loss. In contrast, TPSNet leverages text and phase dual priors to extract domain-invariant semantic features.
  • Figure 2: The pipeline of TPSNet. Left: domain prompt generation via the prompt learning paradigm. Top-right: text-phase dual prior construction with contrastive learning for unsupervised cross-domain image retrieval. Bottom: the detailed architecture of the proposed phase feature encoder.
  • Figure 3: Average Accuracy (%) of UCDIR Methods using (a) ResNet-50 and (b) ViT-B as image encoders.
  • Figure 4: t-SNE visualizations of last-layer features for the baseline model (v1), baseline with text prior (v2), and TPSNet (v3) across two scenarios from two datasets.
  • Figure 5: Grad-CAM visualizations of last-layer features for the baseline model (v1), baseline with text prior (v2), and TPSNet (v3) on randomly selected samples.
  • ...and 5 more figures