UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

Hoang-Bao Le; Allie Tran; Binh T. Nguyen; Liting Zhou; Cathal Gurrin

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, Cathal Gurrin

TL;DR

This work addresses zero-shot image-guided retrieval by unifying CIR and SBIR under IGROT and introduces UNION, a lightweight target representation that fuses the target image embedding with a null-text prompt via a small Transformer-MLP stack. The approach enables the target representation to inhabit the same vision-language embedding space as the query without modifying pretrained backbones, improving semantic alignment across modalities. To support data-efficient learning, the authors construct LlavaSCo (5,000 refined LaSCo triplets) and Training-Sketchy (5,000 SBIR triplets) and demonstrate strong results on CIRCO, FashionIQ, CIRR, Sketchy, TU-Berlin, and QuickDraw, with CIRCO mAP@50 reaching 38.5 and Sketchy mAP@200 reaching 82.7. The results show that UNION, particularly when paired with strong backbones and caption-rich data, yields robust cross-modal retrieval with limited supervision, highlighting practical benefits for scalable IGROT systems.

Abstract

Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

TL;DR

Abstract

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)