Table of Contents
Fetching ...

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

Shitong Sun, Fanghua Ye, Shaogang Gong

TL;DR

The paper addresses the high cost and opaqueness of learning-based zero-shot composed image retrieval by introducing a training-free framework that converts a composed image-text query into explicit text. It combines a Global Retrieval Baseline (GRB) that generates a pseudo target caption and searches in a BLIP2-aligned text-image space, with a Local Concept Re-Ranking (LCR) that identifies discriminative local concepts using an LLM and a vision-language model, then fuses global and local scores. This two-stage approach delivers competitive performance to state-of-the-art triplet-trained methods and substantially improves over other training-free baselines, especially on open-domain datasets, while providing interpretable local attributes. The method reduces computational overhead and enhances generalization by avoiding task-specific training and by leveraging prompt-driven reasoning and explicit concept reasoning for robust CIR.

Abstract

Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image. To avoid difficult to-obtain labeled triplet training data, zero-shot composed image retrieval (ZS-CIR) has been introduced, which aims to retrieve the target image by learning from image-text pairs (self-supervised triplets), without the need for human-labeled triplets. However, this self-supervised triplet learning approach is computationally less effective and less understandable as it assumes the interaction between image and text is conducted with implicit query embedding without explicit semantical interpretation. In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. This helps improve model learning efficiency to enhance the generalization capacity of foundation models. Further, we introduce a Local Concept Re-ranking (LCR) mechanism to focus on discriminative local information extracted from the modified instructions. Extensive experiments on four ZS-CIR benchmarks show that our method achieves comparable performances to that of the state of-the-art triplet training based methods, but significantly outperforms other training-free methods on the open domain datasets (CIRR, CIRCO and COCO), as well as the fashion domain dataset (FashionIQ).

Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking

TL;DR

The paper addresses the high cost and opaqueness of learning-based zero-shot composed image retrieval by introducing a training-free framework that converts a composed image-text query into explicit text. It combines a Global Retrieval Baseline (GRB) that generates a pseudo target caption and searches in a BLIP2-aligned text-image space, with a Local Concept Re-Ranking (LCR) that identifies discriminative local concepts using an LLM and a vision-language model, then fuses global and local scores. This two-stage approach delivers competitive performance to state-of-the-art triplet-trained methods and substantially improves over other training-free baselines, especially on open-domain datasets, while providing interpretable local attributes. The method reduces computational overhead and enhances generalization by avoiding task-specific training and by leveraging prompt-driven reasoning and explicit concept reasoning for robust CIR.

Abstract

Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image. To avoid difficult to-obtain labeled triplet training data, zero-shot composed image retrieval (ZS-CIR) has been introduced, which aims to retrieve the target image by learning from image-text pairs (self-supervised triplets), without the need for human-labeled triplets. However, this self-supervised triplet learning approach is computationally less effective and less understandable as it assumes the interaction between image and text is conducted with implicit query embedding without explicit semantical interpretation. In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. This helps improve model learning efficiency to enhance the generalization capacity of foundation models. Further, we introduce a Local Concept Re-ranking (LCR) mechanism to focus on discriminative local information extracted from the modified instructions. Extensive experiments on four ZS-CIR benchmarks show that our method achieves comparable performances to that of the state of-the-art triplet training based methods, but significantly outperforms other training-free methods on the open domain datasets (CIRR, CIRCO and COCO), as well as the fashion domain dataset (FashionIQ).
Paper Structure (12 sections, 7 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: A comparison of query processing between existing zero-shot composed image retrieval methods and our method. Top: Existing methods are fine-tuned on text-image pairs saito2023pic2wordbaldrati2023zero or task-specific triplets levy2023dataliu2023zero. They employ a fusion embedding for queries lacking explicit semantic context. Bottom: We propose a novel method without the requirement of training. Our method processes queries with explicit semantics at both global and local levels.
  • Figure 2: An overview of the Training-Free Composed Image Retrieval model which consists of Top: Global Retrieval Baseline (GRB) transforms the text-image composed query into a text-only query with a Large Language Model (LLM)-generated pseudo target caption. The global score for retrieval is determined by the similarity between the pseudo target caption's text embedding and target image embeddings. Bottom: Local Concept Re-Ranking (LCR) extracts discriminative local concepts using a task instruction prompt for the LLM. Based on the global score from GRB, the top-$K$ images are re-ranked using both the global and local scores, calculated on their discriminative local concepts. The local score is translated into text prediction probability, detecting the existence of local concepts by a vision-language model, i.e. LLaVA. Both the global and local retrieval processes are conducted within a visual-language aligned feature space, i.e. BLIP2.
  • Figure 3: Qualitative results on CIRR test set. The GRB is the global retrieval baseline, which can merge holistic information from reference image caption and modified text. LCR is local concept re-ranking, which can extract discriminative local information to rerank based on the results of the global retrieval baseline.
  • Figure 4: Qualitative results on COCO for object composition.
  • Figure 5: Ablation study on CIRR test set.