An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Jaeseok Byun; Seokhyeon Jeong; Wonjae Kim; Sanghyuk Chun; Taesup Moon

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

TL;DR

This work identifies a fundamental task discrepancy in projection-based Composed Image Retrieval (CIR) stemming from CLIP-style pretraining versus the CIR objective. It introduces RTD, a text-only post-hoc framework that uses target-anchored text contrastive learning on cheaply generated text triplets (T_r, T_c, T_t) to update the text encoder while keeping the vision backbone and projection module fixed. RTD incorporates a refined concatenation scheme and hard-negative batch sampling to bridge training and inference gaps, and demonstrates strong, consistent gains across multiple CIR benchmarks and backbones with significantly reduced training time compared to synthetic CIR triplet methods. The method achieves competitive performance with markedly higher efficiency, enabling scalable integration with existing projection-based CIR systems and broad applicability across backbone sizes.

Abstract

Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches. The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval. However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-anchored text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating RTD into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches only with 23 minutes of additional training on 4 A100 GPUs (up to $100\times$ faster in training). Our code will be available upon acceptance.

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

TL;DR

Abstract

image) and the target CIR task (image + text

image), which potentially negatively impacts CIR performance. To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-anchored text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating RTD into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches only with 23 minutes of additional training on 4 A100 GPUs (up to

faster in training). Our code will be available upon acceptance.

Paper Structure (32 sections, 1 equation, 6 figures, 21 tables)

This paper contains 32 sections, 1 equation, 6 figures, 21 tables.

Introduction
Related Work
Composed Image Retrieval.
Task discrepancy between the CLIP pre-training task and CIR.
Main Method
Obtaining text triplets
Target-anchored text contrastive learning
Experiments
Experimental setup
Main results
Ablation studies
Anaylses on our core motivation
Compatibility across backbone sizes
Impact of the text triplet generation strategies
Conclusion
...and 17 more sections

Figures (6)

Figure 1: The task discrepancy of projection-based ZS-CIR methods between the pre-training task (image-text alignment) and the ZS-CIR task (image-text composition).
Figure 2: Overview of RTD.
Figure 3: Impact of sizes of backbone. The results of RTD combined with Pic2Word pic2word, SEARLE searle, and LinCIR lincir across different CLIP backbones (ViT-B/32 and ViT-L/14) are shown. Here, the score is the same metric in "Avg" in \ref{['tab:ablation_method']} and other details are the same as \ref{['tab:ablation_method']}. Full results are in the \ref{['subsec:appendix_different_backbones']}.
Figure 4: Example of rule-based triplet datasets
Figure 5: Example of LLM-based triplet datasets
...and 1 more figures

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

TL;DR

Abstract

An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (6)