Table of Contents
Fetching ...

Toward Robust and Harmonious Adaptation for Cross-modal Retrieval

Haobin Li, Mouxing Yang, Xi Peng

TL;DR

This work addresses the QS problem in cross-modal retrieval where online, diverse queries disrupt the general-to-customized adaptation paradigm. It introduces REST, a three-part approach: query prediction refinement to produce meaningful candidate sets, a QS-robust objective combining query uniformity, query-gallery gap, and query-gallery consistency losses, and gradient decoupling to prevent forgetting general knowledge during online adaptation. Through 20 benchmarks spanning image-text, video-audio, and composed image retrieval, REST substantially outperforms baselines under both online and diverse query shifts, while maintaining performance as domain shifts intensify. Theoretical and empirical analyses elucidate why REM and gradient decoupling foster stable, harmonious adaptation, enabling effective online CMR in real-world, multi-domain settings. The results underscore REST’s practical impact for robust, scalable cross-modal retrieval in dynamic environments.

Abstract

Recently, the general-to-customized paradigm has emerged as the dominant approach for Cross-Modal Retrieval (CMR), which reconciles the distribution shift problem between the source domain and the target domain. However, existing general-to-customized CMR methods typically assume that the entire target-domain data is available, which is easily violated in real-world scenarios and thus inevitably suffer from the query shift (QS) problem. Specifically, query shift embraces the following two characteristics and thus poses new challenges to CMR. i) Online Shift: real-world queries always arrive in an online manner, rendering it impractical to access the entire query set beforehand for customization approaches; ii) Diverse Shift: even with domain customization, the CMR models struggle to satisfy queries from diverse users or scenarios, leaving an urgent need to accommodate diverse queries. In this paper, we observe that QS would not only undermine the well-structured common space inherited from the source model, but also steer the model toward forgetting the indispensable general knowledge for CMR. Inspired by the observations, we propose a novel method for achieving online and harmonious adaptation against QS, dubbed Robust adaptation with quEry ShifT (REST). To deal with online shift, REST first refines the retrieval results to formulate the query predictions and accordingly designs a QS-robust objective function on these predictions to preserve the well-established common space in an online manner. As for tackling the more challenging diverse shift, REST employs a gradient decoupling module to dexterously manipulate the gradients during the adaptation process, thus preventing the CMR model from forgetting the general knowledge. Extensive experiments on 20 benchmarks across three CMR tasks verify the effectiveness of our method against QS.

Toward Robust and Harmonious Adaptation for Cross-modal Retrieval

TL;DR

This work addresses the QS problem in cross-modal retrieval where online, diverse queries disrupt the general-to-customized adaptation paradigm. It introduces REST, a three-part approach: query prediction refinement to produce meaningful candidate sets, a QS-robust objective combining query uniformity, query-gallery gap, and query-gallery consistency losses, and gradient decoupling to prevent forgetting general knowledge during online adaptation. Through 20 benchmarks spanning image-text, video-audio, and composed image retrieval, REST substantially outperforms baselines under both online and diverse query shifts, while maintaining performance as domain shifts intensify. Theoretical and empirical analyses elucidate why REM and gradient decoupling foster stable, harmonious adaptation, enabling effective online CMR in real-world, multi-domain settings. The results underscore REST’s practical impact for robust, scalable cross-modal retrieval in dynamic environments.

Abstract

Recently, the general-to-customized paradigm has emerged as the dominant approach for Cross-Modal Retrieval (CMR), which reconciles the distribution shift problem between the source domain and the target domain. However, existing general-to-customized CMR methods typically assume that the entire target-domain data is available, which is easily violated in real-world scenarios and thus inevitably suffer from the query shift (QS) problem. Specifically, query shift embraces the following two characteristics and thus poses new challenges to CMR. i) Online Shift: real-world queries always arrive in an online manner, rendering it impractical to access the entire query set beforehand for customization approaches; ii) Diverse Shift: even with domain customization, the CMR models struggle to satisfy queries from diverse users or scenarios, leaving an urgent need to accommodate diverse queries. In this paper, we observe that QS would not only undermine the well-structured common space inherited from the source model, but also steer the model toward forgetting the indispensable general knowledge for CMR. Inspired by the observations, we propose a novel method for achieving online and harmonious adaptation against QS, dubbed Robust adaptation with quEry ShifT (REST). To deal with online shift, REST first refines the retrieval results to formulate the query predictions and accordingly designs a QS-robust objective function on these predictions to preserve the well-established common space in an online manner. As for tackling the more challenging diverse shift, REST employs a gradient decoupling module to dexterously manipulate the gradients during the adaptation process, thus preventing the CMR model from forgetting the general knowledge. Extensive experiments on 20 benchmarks across three CMR tasks verify the effectiveness of our method against QS.

Paper Structure

This paper contains 42 sections, 4 theorems, 29 equations, 6 figures, 19 tables.

Key Result

Theorem 1

Let $\hat{p}$ denotes the refined query prediction of the given query $x$, where $\hat{p}_{i}$ denotes the probability that $x$ is associated with the $i$-th candidate. For the entropy minimization objective $\mathcal{L}_{EM}=-\sum_{i}\hat{p}_{i}\log \hat{p}_{i}$, the gradient with respect to the pr where $|\cdot|$ denotes the absolute value.

Figures (6)

  • Figure 1: (a) Query Shift. The general-to-customized CMR paradigm would suffer from the query shift problem, which exhibits online shift and diverse shift characteristics. On the one hand, in real-world applications, inquirers would incrementally offer queries, thus forming the online query stream. Among various CMR tasks, queries exhibit not only uni-distribution shift but also more complex multi-distribution shift, both of which would invalidate the general-to-customized paradigm. On the other hand, the submitted queries may originate from highly-personalized domains, e.g., in e-commerce transactions, queries are highly personalized and often involve products from various domains such as instruments, sports, and luxury goods. Unfortunately, accommodating diverse queries would inevitably suffer from the dilemma of forgetting the general knowledge for CMR tasks. (b) Observations. we study the QS problem for the image-text, video-audio, and composed image retrieval and reveal the following observations: i) QS not only diminishes the uniformity of queries, but also amplifies the gap and weakens the consistency between the query and gallery sets, both of which would undermine the well-structured common space inherited from source models; ii) simply employing the vanilla TTA method (e.g., Tent) on diverse queries would result in an obtuse angle between the gradients with respect to domain-specific and pre-training data. In other words, the harmony between domain-specific knowledge and general knowledge becomes fragile during the adaptation process.
  • Figure 2: Overview of the proposed REST. For the given online queries from diverse domains, the query and gallery encoders are first adopted to map the queries and candidates into the common space. The obtained embeddings are passed into the query prediction refinement module, which selects positive and valuable negative pairs for each query to formulate the refined query prediction. After that, the positives with higher uniformity and lower consistency are adopted to estimate the threshold for noise filtering and the query-gallery gap that constrains the adaptation process. Then, three independent losses are employed to achieve robust adaptation against QS. Finally, the gradient decoupling module manipulates the strength and direction of the gradient to achieve harmonious adaptation, thus avoid forgetting the general knowledge.
  • Figure 3: Observation of the query uniformity and query-gallery gap. Increasing $\lambda^{\operatorname{scale}}$ indicates enlarging query uniformity, while the decreasing $\lambda^{\operatorname{offset}}$ indicates narrowing query-gallery gap. Notably, $\lambda^{\operatorname{scale}}=1.0$ and $\lambda^{\operatorname{offset}}=0$ represent no scaling and no offset, respectively.
  • Figure 4: t-SNE visualization of query and candidate embeddings after employing the proposed REST.
  • Figure 5: Finer-grained Ablation studies. (a) Parameter analysis of $\tau$ on the vanilla TTA method Tent w/ (solid line) and w/o (dotted line) the query prediction refinement module. (b) Parameter analysis of the selected neighbors $K$ in Eq. \ref{['eq: new prediction']}. (c) Analysis of query-gallery consistency before and after employing REST.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Definition 1: Query Shift
  • Theorem 1
  • Definition 2
  • Definition 3
  • Theorem 2
  • Theorem 1
  • proof
  • Theorem 2
  • proof