Table of Contents
Fetching ...

Vision-by-Language for Training-Free Compositional Image Retrieval

Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata

TL;DR

This work tackles zero-shot compositional image retrieval (CIR) without task-specific training by introducing CIReVL, a training-free pipeline that reframes CIR as language-based reasoning. It captions the query image with an off-the-shelf VLM, uses a large language model to generate a target caption conditioned on the textual modification, and performs CLIP-based retrieval with the generated caption. CIReVL achieves competitive or state-of-the-art results across CIRCO, CIRR, FashionIQ, and GeneCIS benchmarks, and its modular design enables straightforward scaling by swapping in stronger retrieval models. Moreover, the language-centered approach offers human interpretability and the possibility of post-hoc interventions to correct failures, broadening practical applicability and facilitating analysis of bottlenecks and scaling behavior.

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.

Vision-by-Language for Training-Free Compositional Image Retrieval

TL;DR

This work tackles zero-shot compositional image retrieval (CIR) without task-specific training by introducing CIReVL, a training-free pipeline that reframes CIR as language-based reasoning. It captions the query image with an off-the-shelf VLM, uses a large language model to generate a target caption conditioned on the textual modification, and performs CLIP-based retrieval with the generated caption. CIReVL achieves competitive or state-of-the-art results across CIRCO, CIRR, FashionIQ, and GeneCIS benchmarks, and its modular design enables straightforward scaling by swapping in stronger retrieval models. Moreover, the language-centered approach offers human interpretability and the possibility of post-hoc interventions to correct failures, broadening practical applicability and facilitating analysis of bottlenecks and scaling behavior.

Abstract

Given an image and a target modification (e.g an image of the Eiffel tower and the text "without people and at night-time"), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we propose to tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods. Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
Paper Structure (24 sections, 1 equation, 6 figures, 6 tables)

This paper contains 24 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Compositional Image Retrieval through Vision-by-Language (CIReVL). Given an input image and a text modifier, we use an off-the-shelf vision-language model to caption the image. A LLM processes the generated caption and the text modifier to generate a description of the desired target image. To obtain the final image, we use a vision-language model and perform text-to-image retrieval. CIReVL is modular, training-free and human-understandable in natural language.
  • Figure 2: Examples from the CIRCO validation set where our method retrieves the desired image. We see that our method is able to perform this task for a wide variety of modifier texts.
  • Figure 3: We analyze failure cases of our approach. Due to the interpretable nature, we can easily attribute errors to captioning, reasoning or the text-image retrieval. We also find examples where the model is penalized despite a plausible retrieval due to insufficient annotations in the dataset.
  • Figure 4: We demonstrate the possibility of user interventions to enhance the performance of our method. For instance, by fixing the mistakes in the generated caption, we are able to correctly retrieve the desired image without having to make any other changes.
  • Figure 5: We use TIFA tifa to compare alignment between LLM descr. & base caption and target image, and LLM descr. to target- and CLIP-retrieved image. It shows the impact of LLM reasoning over base captions, and CLIP retrieval as essential bottleneck.
  • ...and 1 more figures