Table of Contents
Fetching ...

Test-time Adaptation for Cross-modal Retrieval with Query Shift

Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, Xiting Liu, Mouxing Yang

TL;DR

This paper proposes a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR), which employs a novel module to refine the query predictions and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift.

Abstract

The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. The code will be released upon acceptance.

Test-time Adaptation for Cross-modal Retrieval with Query Shift

TL;DR

This paper proposes a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR), which employs a novel module to refine the query predictions and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift.

Abstract

The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. The code will be released upon acceptance.

Paper Structure

This paper contains 30 sections, 15 equations, 10 figures, 12 tables, 2 algorithms.

Figures (10)

  • Figure 1: (a) Dominant Paradigm: the pre-trained models embrace powerful zero-shot retrieval capacity and could be fine-tuned on domain-specific data for customization, which has emerged as the dominant paradigm for cross-modal retrieval. (b) Query Shift: the performance of the paradigm would be significantly degraded when encountering the query shift problem. On the one hand, collecting sufficient data to tailor the pre-trained models for scarce domains is daunting and even impossible. On the other hand, as the saying goes, "Different strokes for different folks", even fine-tuned models cannot accommodate all personalized domains. (c) Observations: we study the query shift problem for cross-modal retrieval and reveal the following observations. Namely, query shift not only diminishes the uniformity of the query modality but also amplifies the modality gap between the query and gallery modalities, undermining the well-structured common space inherited from pre-trained models.
  • Figure 2: Overview of the proposed TCR. For the given online queries, the modality-specific encoders are employed to project the query and gallery samples into the latent space established by the source model. The obtained query embeddings and gallery embeddings are passed into the query prediction refinement module. In the module, TCR first selects the most similar gallery sample for each query and obtain the query-gallery pairs. After that, the pairs with higher uniformity and lower modality gap are chosen to estimate the filtering threshold of query predictions and modality gap of the source model as the constraints for the adaptation. Finally, three loss functions are employed to achieve robust adaptation for cross-modal retrieval with query shift.
  • Figure 3: Observation of the intra-modality uniformity and inter-modality gap. The increasing $\lambda^{\operatorname{scale}}$ indicates the growing intra-modality uniformity while the decreasing $\lambda^{\operatorname{offset}}$ indicates the narrowing inter-modality gap. Notably, $\lambda^{\operatorname{scale}}=1.0$ and $\lambda^{\operatorname{offset}}=0$ represent no scaling and no offset, respectively.
  • Figure 4: Finer-grained Ablation studies. (a) The parameter analysis of $\tau$ (Eq. \ref{['eq: prediction']} and Eq. \ref{['eq: new prediction']}) on the vanilla TTA method Tent w/ (solid line) and w/o (dotted line) the query prediction refinement module. (b) The parameter analysis of $t$ in Eq. \ref{['eq: loss uniformity']}. (c) The t-SNE visualization of TR on the query and gallery embeddings after employing the proposed TCR.
  • Figure 5: Examples of 16 types of image corruption. The original image is from the COCO dataset.
  • ...and 5 more figures