Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval
Young Kyun Jang, Donghyun Kim, Zihang Meng, Dat Huynh, Ser-Nam Lim
TL;DR
This work tackles the limited scalability of Composed Image Retrieval (CIR) by introducing a semi-supervised framework that uses a Visual Delta Generator (VDG) to synthesize textual visual deltas between image pairs. The VDG is trained in two stages (alignment and instruction tuning) and paired with an auxiliary image gallery to generate pseudo triplets, which augment CIR training with both supervised and pseudo data using a joint contrastive objective and a target-delta matching loss. Empirically, the approach achieves state-of-the-art results on CIRR and FashionIQ while reducing reliance on costly human annotations; VDG-delivered deltas also match human annotations in quality and can outperform them when combined. The method demonstrates a scalable, model-agnostic path for domain-specific CIR by leveraging large multimodal models and unlabeled data, with practical implications for deploying CIR in diverse visual domains.
Abstract
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. These specific triplets are not as commonly available as simple image-text pairs, limiting the widespread use of CIR and its scalability. On the other hand, zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation, but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e., visual delta) between the two. VDG, equipped with fluent language knowledge and being model agnostic, can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks.
