Table of Contents
Fetching ...

MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu, Zhao Jin, Jingyi Liao, Xiao Luo, Yingjie Wang, Li Shen, Dacheng Tao

TL;DR

The paper tackles zero-shot composed image retrieval by moving beyond adapter-based pseudo-text strategies. It introduces MVFT-JI, which uses a pretrained multimodal language model (MLLM) to synthesize two training tasks from unlabeled images and jointly fine-tunes a vision-language model (VLM) via these tasks. During inference, MVFT-JI combines VLM-based compositional alignment with MLLM-generated target descriptions to improve retrieval accuracy. Empirical results on FashionIQ, CIRCO, and CIRR demonstrate state-of-the-art performance, with ablations validating the necessity of joint training, target-text supervision, and MLLM-guided data generation. The approach reduces annotation costs and enhances generalization, though it acknowledges risks from MLLM hallucinations and biases and suggests future mitigation strategies.

Abstract

Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM's semantic alignment strengths with the MLLM's reasoning capabilities.

MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

TL;DR

The paper tackles zero-shot composed image retrieval by moving beyond adapter-based pseudo-text strategies. It introduces MVFT-JI, which uses a pretrained multimodal language model (MLLM) to synthesize two training tasks from unlabeled images and jointly fine-tunes a vision-language model (VLM) via these tasks. During inference, MVFT-JI combines VLM-based compositional alignment with MLLM-generated target descriptions to improve retrieval accuracy. Empirical results on FashionIQ, CIRCO, and CIRR demonstrate state-of-the-art performance, with ablations validating the necessity of joint training, target-text supervision, and MLLM-guided data generation. The approach reduces annotation costs and enhances generalization, though it acknowledges risks from MLLM hallucinations and biases and suggests future mitigation strategies.

Abstract

Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to construct two complementary training tasks using only unlabeled images: target text retrieval taskand text-to-image retrieval task. By jointly optimizing these tasks, our method enables the VLM to inherently acquire robust compositional retrieval capabilities, supported by the provided theoretical justifications and empirical validation. Furthermore, during inference, we further prompt the MLLM to generate target texts from composed queries and compute retrieval scores by integrating similarities between (i) the composed query and candidate images, and (ii) the MLLM-generated target text and candidate images. This strategy effectively combines the VLM's semantic alignment strengths with the MLLM's reasoning capabilities.

Paper Structure

This paper contains 35 sections, 19 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The training and inference framework of MVFT-JI.
  • Figure 2: Prompt template $P_m$ for modification text generation.
  • Figure 3: Prompt template $P_{tt}$ for target text generation.
  • Figure 4: Prompt template $P_c$ for image caption generation.
  • Figure 5: Prompt template $P'_m$ for modification text generation in the ablation study.
  • ...and 5 more figures