Table of Contents
Fetching ...

Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval

Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu

TL;DR

This paper tackles Composed Image Retrieval by introducing CIR-LVLM, which uses a large vision-language model as a user intent-aware encoder to jointly process a reference image and a relative caption. A novel hybrid intent instruction module provides two levels of guidance: a task-level prompt and an instance-specific soft prompt drawn from a learnable prompt pool, enabling adaptive, instance-aware reasoning. The approach leverages a Connector to map visual content into sentence-level prompts and optimizes representations with a contrastive objective, achieving state-of-the-art performance on Fashion-IQ, Shoes, and CIRR benchmarks while maintaining single-pass, efficient inference. The work demonstrates that LVLMs can surpass traditional Vision-Language Models in multimodal CIR tasks and offers actionable insights into prompt design and interpretability for reasoning-driven retrieval.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies the task requirement and assists the model in discerning user intent at the task level. (2) The instance-specific soft prompt, which is adaptively selected from the learnable prompt pool, enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks with acceptable inference efficiency. We believe this study provides fundamental insights into CIR-related fields.

Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval

TL;DR

This paper tackles Composed Image Retrieval by introducing CIR-LVLM, which uses a large vision-language model as a user intent-aware encoder to jointly process a reference image and a relative caption. A novel hybrid intent instruction module provides two levels of guidance: a task-level prompt and an instance-specific soft prompt drawn from a learnable prompt pool, enabling adaptive, instance-aware reasoning. The approach leverages a Connector to map visual content into sentence-level prompts and optimizes representations with a contrastive objective, achieving state-of-the-art performance on Fashion-IQ, Shoes, and CIRR benchmarks while maintaining single-pass, efficient inference. The work demonstrates that LVLMs can surpass traditional Vision-Language Models in multimodal CIR tasks and offers actionable insights into prompt design and interpretability for reasoning-driven retrieval.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies the task requirement and assists the model in discerning user intent at the task level. (2) The instance-specific soft prompt, which is adaptively selected from the learnable prompt pool, enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks with acceptable inference efficiency. We believe this study provides fundamental insights into CIR-related fields.

Paper Structure

This paper contains 30 sections, 7 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Workflows of existing CIR methods and our proposed CIR-LVLM: (a) Early-fusion, (b) Textual inversion, and (c) Our proposed CIR-LVLM. It can be seen that the first two fusion strategies fail to discern the user intent conveyed by the relative caption: (a) fails to retain the species of Corgi, and (b) fails to move the dog outdoors. Our fusion strategy leverages the superior user intent-aware capability of LVLM and successfully recalls the target image.
  • Figure 2: (a) Illustration of the architecture overview of our proposed model. All the parameters are shared between the query and target image. The intent instructions are used to form the inputs of LLM. The details of intent instructions can be found in Fig.\ref{['template']}. (b) Details of the prompt pool. We select prompts according to both visual features and text embeddings.
  • Figure 3: Illustration of intent instructions for the hybrid-modality query and the target image. An intent instruction consists of three components: (1) Task Input, (2) Task Prompt, and (3) Instance-Specific Soft Prompt.
  • Figure 4: Influence of (a) length of soft prompt and (b) length of prompt pool.
  • Figure 5: Attention map visualization (right side of the first and third rows). The sum of the attention weights over all the visual or relative caption tokens for the soft prompt and hard prompt. (second and third rows).
  • ...and 4 more figures