Toward Robust and Harmonious Adaptation for Cross-modal Retrieval
Haobin Li, Mouxing Yang, Xi Peng
TL;DR
This work addresses the QS problem in cross-modal retrieval where online, diverse queries disrupt the general-to-customized adaptation paradigm. It introduces REST, a three-part approach: query prediction refinement to produce meaningful candidate sets, a QS-robust objective combining query uniformity, query-gallery gap, and query-gallery consistency losses, and gradient decoupling to prevent forgetting general knowledge during online adaptation. Through 20 benchmarks spanning image-text, video-audio, and composed image retrieval, REST substantially outperforms baselines under both online and diverse query shifts, while maintaining performance as domain shifts intensify. Theoretical and empirical analyses elucidate why REM and gradient decoupling foster stable, harmonious adaptation, enabling effective online CMR in real-world, multi-domain settings. The results underscore REST’s practical impact for robust, scalable cross-modal retrieval in dynamic environments.
Abstract
Recently, the general-to-customized paradigm has emerged as the dominant approach for Cross-Modal Retrieval (CMR), which reconciles the distribution shift problem between the source domain and the target domain. However, existing general-to-customized CMR methods typically assume that the entire target-domain data is available, which is easily violated in real-world scenarios and thus inevitably suffer from the query shift (QS) problem. Specifically, query shift embraces the following two characteristics and thus poses new challenges to CMR. i) Online Shift: real-world queries always arrive in an online manner, rendering it impractical to access the entire query set beforehand for customization approaches; ii) Diverse Shift: even with domain customization, the CMR models struggle to satisfy queries from diverse users or scenarios, leaving an urgent need to accommodate diverse queries. In this paper, we observe that QS would not only undermine the well-structured common space inherited from the source model, but also steer the model toward forgetting the indispensable general knowledge for CMR. Inspired by the observations, we propose a novel method for achieving online and harmonious adaptation against QS, dubbed Robust adaptation with quEry ShifT (REST). To deal with online shift, REST first refines the retrieval results to formulate the query predictions and accordingly designs a QS-robust objective function on these predictions to preserve the well-established common space in an online manner. As for tackling the more challenging diverse shift, REST employs a gradient decoupling module to dexterously manipulate the gradients during the adaptation process, thus preventing the CMR model from forgetting the general knowledge. Extensive experiments on 20 benchmarks across three CMR tasks verify the effectiveness of our method against QS.
