Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang; Dat Huynh; Ashish Shah; Wen-Kai Chen; Ser-Nam Lim

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim

TL;DR

The paper tackles the scalability of Composed Image Retrieval (CIR) by proposing a zero-shot framework that directly interpolates image and text embeddings using Spherical Linear Interpolation (Slerp), avoiding projection-based pseudo-words. It introduces Text-Anchored-Tuning (TAT), which freezes the text encoder and fine-tunes the image encoder with low-parameter LoRA to align image embeddings with fixed text embeddings, thereby reducing the modality gap. Together, Slerp and TAT yield state-of-the-art zero-shot CIR performance across CIRR, CIRCO, and FashionIQ benchmarks, while requiring minimal training data and resources (single epoch, <0.5% trainable parameters). Moreover, TAT-trained models provide strong initial checkpoints for supervised CIR, indicating broad practical impact for both zero-shot and supervised regimes in diverse domains.

Abstract

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 7 figures, 7 tables)

This paper contains 16 sections, 4 equations, 7 figures, 7 tables.

Introduction
Related Work
Supervised Composed Image Retrieval
Zero-shot Composed Image Retrieval
Method
Preliminaries
Spherical Linear Interpolation-based Retrieval
Text-Anchored-Tuning
Inference
Experiments
Settings
Main Results
Further Analysis
Discussion
Conclusion
...and 1 more sections

Figures (7)

Figure 1: Overview of ZS-CIR approaches. The previous works Pic2wordPALAVRASEARLE utilize a projection module, which transforms an image into a textual pseudo-word. This is then combined with text (textual intent) to produce a composed embedding with a text encoder for retrieval purposes. In contrast, we propose a method based on a simple spherical linear interpolation. This method directly combines image (v) and text (w) embeddings to produce a composed embedding (c). We then use c to perform ZS-CIR.
Figure 2: Workflow of Text-Anchored-Tuning.
Figure 3: mAP and Recall scores by varying the $\alpha$ of Slerp with CLIP-ViT-L/14 model on CIRCO test set.
Figure 4: Qualitative results on CIRCO validation set. Green box denotes ground truth.
Figure 5: Retrieval results on CIRR test set.
...and 2 more figures

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

TL;DR

Abstract

Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)