Table of Contents
Fetching ...

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

Wenliang Zhong, Weizhi An, Feng Jiang, Hehuan Ma, Yuzhi Guo, Junzhou Huang

TL;DR

This work addresses zero-shot Composed Image Retrieval (ZS-CIR) by reframing CIR as an instruction-following task and leveraging instruction-tuned Multimodal LLMs (MLLMs) to produce instruction-aware multimodal embeddings. A two-stage training pipeline first aligns images and text in a joint embedding space using image-caption data, then fine-tunes on a CIR-like triplet dataset (source image, modification instruction, target caption) to strengthen instruction following. The model, InstructCIR, demonstrates substantial improvements over state-of-the-art baselines across FashionIQ, CIRR, CIRCO, and GeneCIS by effectively integrating visual and instructional information and generalizing to unseen CIR scenarios. The approach offers practical potential for scalable, instruction-aware CIR in diverse domains and devices, supported by ablations and sensitivity analyses that underscore the benefits of staged training, diverse data, and adaptable MLLM mechanisms.

Abstract

Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.

Compositional Image Retrieval via Instruction-Aware Contrastive Learning

TL;DR

This work addresses zero-shot Composed Image Retrieval (ZS-CIR) by reframing CIR as an instruction-following task and leveraging instruction-tuned Multimodal LLMs (MLLMs) to produce instruction-aware multimodal embeddings. A two-stage training pipeline first aligns images and text in a joint embedding space using image-caption data, then fine-tunes on a CIR-like triplet dataset (source image, modification instruction, target caption) to strengthen instruction following. The model, InstructCIR, demonstrates substantial improvements over state-of-the-art baselines across FashionIQ, CIRR, CIRCO, and GeneCIS by effectively integrating visual and instructional information and generalizing to unseen CIR scenarios. The approach offers practical potential for scalable, instruction-aware CIR in diverse domains and devices, supported by ablations and sensitivity analyses that underscore the benefits of staged training, diverse data, and adaptable MLLM mechanisms.

Abstract

Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. CIR is inherently an instruction-following task, as the model needs to interpret and apply modifications to the image. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. While existing ZS-CIR models based on CLIP have shown promising results, their capability in interpreting and following modification instructions remains limited. Some research attempts to address this by incorporating Large Language Models (LLMs). However, these approaches still face challenges in effectively integrating multimodal information and instruction understanding. To tackle above challenges, we propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation, which significantly enhance the instruction following capability for a comprehensive integration between images and instructions. Nevertheless, directly applying MLLMs introduces a new challenge since MLLMs are primarily designed for text generation rather than embedding extraction as required in CIR. To address this, we introduce a two-stage training strategy to efficiently learn a joint multimodal embedding space and further refining the ability to follow modification instructions by tuning the model in a triplet dataset similar to the CIR format. Extensive experiments on four public datasets: FashionIQ, CIRR, GeneCIS, and CIRCO demonstrates the superior performance of our model, outperforming state-of-the-art baselines by a significant margin. Codes are available at the GitHub repository.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Comparison of Existing ZS-CIR Approaches vs. InstructCIR. Current state-of-the-art CIR methods typically rely on VLMs such as CLIP. These methods are constrained by the limited instruction-following capabilities in CLIP models. In contrast, Our approach employs instruction-tuned MLLMs specifically designed for instruction-following tasks including CIR. As shown in the attention map derived from the composed embedding using yu2024attention. Our approach is able to focus on specific parts of the image following the modification instruction. In the example, the front wheel and the floor are highlighted according to the “on a track” and “front wheel in the air” of the modification.
  • Figure 2: The Two-Stage Training Strategy for InstructCIR. The diagram illustrates our two-stage approach. Stage 1: The model is trained on image-caption pairs $(i, c)$ to align multimodal embeddings. The image is encoded by the MLLM to $h_i$, while the caption is processed to generate $h_c$. This stage establishes a shared embedding space for both modalities. Stage 2: The model is fine-tuned with triplet data $(i, t, c_r)$. The image and modifier text are composed into an embedding $h_{it}$, while the modified caption is encoded as $h_{c_r}$. The objective is to align $h_{it}$ and $h_{c_r}$, enhancing instruction-following abilities. The visual module includes the visual encoder and adapter. The strategy effectively handles CIR tasks by integrating visual and textual information. Inference: During inference, the source image is encoded with the corresponding modification instruction to $h_{it}$. Target images are encoded to $h_{i_r}$, which can be pre-computed and cached. The CIR system leverages the composed embedding $h_{it}$ to find the matched target image embedding $h_{i_r}$.
  • Figure 3: Model Architecture: For composed inputs (images and texts), the image $i$ is processed by a visual encoder and adapter, while the instruction $t$ is tokenized. Both are concatenated and fed into the LLM along with the [EOS] token. The final output at the [EOS] token provides the unified embedding $h$. For text-only inputs, the visual encoder and adapter are bypassed. The Causal Attention in the LLM update previous token information into the current token, comprehensively integrating the image and instruction information into the [EOS] and finally resulting in an instruction-aware composed embedding $h$.
  • Figure 4: Examples from CIRR (top) and CIRCO (bottom) validation sets. Results are ranked from the highest (left) to lowest (right) similarity. InstructCIR effectively retrieves images across a wide variety of modifier instructions from source images.
  • Figure 5: Effectiveness of the triplet data by scale. The baseline is our model trained with the whole original CC3M pair data. The plot demonstrates the performance curve on validation sets by steps. The performance improves rapidly at beginning steps.
  • ...and 3 more figures