Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu; Sanghyuk Chun; Wonjae Kim; Yoohoon Kang; Sangdoo Yun

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun

TL;DR

This work proposes a novel CIR framework, only using language for its training, and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ.

Abstract

Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir

Language-only Efficient Training of Zero-shot Composed Image Retrieval

TL;DR

Abstract

Paper Structure (33 sections, 8 figures, 15 tables)

This paper contains 33 sections, 8 figures, 15 tables.

Introduction
Preliminaries
Vision-langauge models (VLM).
VLM modality gap.
CIR by projection to token embeddings.
Language-only Training of Zero-shot CIR
Self-Masking Projection (SMP)
Searching for a better noise distribution for reducing the modality gap.
Efficiency and scalability
Experiments
Implementation details
Experimental protocols
Evaluation benchmarks and metrics.
Comparison methods.
Main results
...and 18 more sections

Figures (8)

Figure 1: Training time (hours) vs. Zero-shot Composed Image Retrieval (ZS-CIR) performance. Thanks to our efficient language-only training strategy, our LinCIR outperforms the previous ZS-CIR methods in both training time and CIR performance. The training time is measured on 8 A100 GPUs. We compare the models on the CIRCO mAP@5 searle score for a more comprehensive evaluation of CIR models (more results are in \ref{['fig:tt_vs_perf']}). Notably, when we scale up the backbone CLIP clipopenclip model size by ViT-L, ViT-H and ViT-G, LinCIR shows a promising performance boost with surprisingly short training time (48 mins for ViT-G). On the other hand, Pic2Word pic2word and SEARLE searle cannot be scaled up to CLIP ViT-G due to their limitation on restricted textual expressions and the lack of diversity of input texts.
Figure 2: Overview of ZS-CIR with a projection to the token embedding space. The mainstream ZS-CIR methods, such as Pic2Word pic2word, SEARLE searle and LinCIR (ours), train a projection module $\phi$ that projects the image latent embedding $z_i$ into the token embedding space $e_c$ with a custom prompt (e.g., a photo of [$] that [cond]). The textual encoder output is used for CIR.
Figure 3: Comparison of Pic2Word pic2word and LinCIR training procedures. (a) Pic2Word pic2word and SEARLE searle training procedure requires both the visual encoder and the textual encoder. They only need images for training, while the text prompt is pre-defined pic2word or automatically generated searle. (b) LinCIR is trained solely on texts with the frozen textual encoder. First, a projection module $\phi$ projects a textual latent embedding of a sentence $z_t$ into the token embedding space. Before the projection, a random noise $n$ is added to $z_t$ to reduce the modality gap between text and image. We introduce a new self-supervision, named Self-Masking Projection (SMP), by replacing all keywords of the given caption with the projected embedding by $\phi$ and extracting a modified text embedding $\widehat{z}_t$. Finally, the projection module $\phi$ is trained by the MSE loss between $z_t$ and $\widehat{z}_t$. Note that both (a) and (b) use the same inference strategy shown in \ref{['fig:zs_cir']}.
Figure 4: Training time vs. CIR performances. We evaluate three CIR methods with three backbone sizes: ViT-L, ViT-H, and ViT-G. To avoid an unreliable assessment due to the nature of R@1, CIR performances are measured in CIRCO mAP@5 searle, GeneCIS average R@3 genecis, FashionIQ Average R@50 fashioniq, and CIRR average R@10 cirr. In all evaluation results, LinCIR achieves the best training time-performance trade-off. Moreover, Pic2Word and SEARLE show degenerated performances when scaling up the backbone size.
Figure A.1: CIR Dataset examples. In all examples, the first image is the reference, and the right image is the target image with the given caption. For CIRCO, the left image is the query image, and the other four images are all ground truth images with the given text query.
...and 3 more figures

Language-only Efficient Training of Zero-shot Composed Image Retrieval

TL;DR

Abstract

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (8)