Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Young Kyun Jang; Donghyun Kim; Zihang Meng; Dat Huynh; Ser-Nam Lim

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Young Kyun Jang, Donghyun Kim, Zihang Meng, Dat Huynh, Ser-Nam Lim

TL;DR

This work tackles the limited scalability of Composed Image Retrieval (CIR) by introducing a semi-supervised framework that uses a Visual Delta Generator (VDG) to synthesize textual visual deltas between image pairs. The VDG is trained in two stages (alignment and instruction tuning) and paired with an auxiliary image gallery to generate pseudo triplets, which augment CIR training with both supervised and pseudo data using a joint contrastive objective and a target-delta matching loss. Empirically, the approach achieves state-of-the-art results on CIRR and FashionIQ while reducing reliance on costly human annotations; VDG-delivered deltas also match human annotations in quality and can outperform them when combined. The method demonstrates a scalable, model-agnostic path for domain-specific CIR by leveraging large multimodal models and unlabeled data, with practical implications for deploying CIR in diverse visual domains.

Abstract

Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

TL;DR

Abstract

Paper Structure (42 sections, 5 equations, 14 figures, 8 tables, 3 algorithms)

This paper contains 42 sections, 5 equations, 14 figures, 8 tables, 3 algorithms.

Introduction
Related Work
Composed Image Retrieval.
Semi-supervised Learning.
Multi-modal Models for Image-Text Retrieval.
Method
Overview.
Visual Delta Generator Training
Stage 1: Alignment.
Stage 2: Instruction Tuning.
Pseudo Triplet Generation for CIR
Semi-supervised CIR Training
Preliminaries.
Model Architecture.
Supervised / Pseudo Separated Contrastive Loss.
...and 27 more sections

Figures (14)

Figure 1: An illustration of the data preparation process of (a) conventional supervised Composed Image Retrieval (CIR) vs.(b) our proposed semi-supervised CIR. While supervised CIR struggles to scale up due to high annotation costs, our semi-supervised method offers a cost-effective and scalable solution. It augments training samples efficiently by generating pseudo triplets through our Large Language Model (LLM)-based Visual Delta Generator.
Figure 2: An overview of the VDG tuning process. It includes (a) a vision projector and (b) a Large Language Model (LLM). The VDG is trained to produce visual delta that accurately describes the difference between a reference image and its corresponding target image.
Figure 3: A template prompt for VDG instruction tuning.
Figure 4: The process of pseudo triplet generation. First, an image subgroup is constructed based on visual similarity (left). Then, paired reference and target images are fed into the VDG to generate the visual delta, completing the triplet formation (right).
Figure 5: Illustration of our proposed adaptation of the BLIP image-grounded text encoder for CIR. Both reference ($x^r$) and target image ($x^t$) patch tokens are processed by the text encoder ($f_{\theta}$).
...and 9 more figures

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

TL;DR

Abstract

Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (14)