Table of Contents
Fetching ...

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

Hoang-Bao Le, Allie Tran, Binh T. Nguyen, Liting Zhou, Cathal Gurrin

TL;DR

This work addresses zero-shot image-guided retrieval by unifying CIR and SBIR under IGROT and introduces UNION, a lightweight target representation that fuses the target image embedding with a null-text prompt via a small Transformer-MLP stack. The approach enables the target representation to inhabit the same vision-language embedding space as the query without modifying pretrained backbones, improving semantic alignment across modalities. To support data-efficient learning, the authors construct LlavaSCo (5,000 refined LaSCo triplets) and Training-Sketchy (5,000 SBIR triplets) and demonstrate strong results on CIRCO, FashionIQ, CIRR, Sketchy, TU-Berlin, and QuickDraw, with CIRCO mAP@50 reaching 38.5 and Sketchy mAP@200 reaching 82.7. The results show that UNION, particularly when paired with strong backbones and caption-rich data, yields robust cross-modal retrieval with limited supervision, highlighting practical benefits for scalable IGROT systems.

Abstract

Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.

UNION: A Lightweight Target Representation for Efficient Zero-Shot Image-Guided Retrieval with Optional Textual Queries

TL;DR

This work addresses zero-shot image-guided retrieval by unifying CIR and SBIR under IGROT and introduces UNION, a lightweight target representation that fuses the target image embedding with a null-text prompt via a small Transformer-MLP stack. The approach enables the target representation to inhabit the same vision-language embedding space as the query without modifying pretrained backbones, improving semantic alignment across modalities. To support data-efficient learning, the authors construct LlavaSCo (5,000 refined LaSCo triplets) and Training-Sketchy (5,000 SBIR triplets) and demonstrate strong results on CIRCO, FashionIQ, CIRR, Sketchy, TU-Berlin, and QuickDraw, with CIRCO mAP@50 reaching 38.5 and Sketchy mAP@200 reaching 82.7. The results show that UNION, particularly when paired with strong backbones and caption-rich data, yields robust cross-modal retrieval with limited supervision, highlighting practical benefits for scalable IGROT systems.

Abstract

Image-Guided Retrieval with Optional Text (IGROT) is a general retrieval setting where a query consists of an anchor image, with or without accompanying text, aiming to retrieve semantically relevant target images. This formulation unifies two major tasks: Composed Image Retrieval (CIR) and Sketch-Based Image Retrieval (SBIR). In this work, we address IGROT under low-data supervision by introducing UNION, a lightweight and generalisable target representation that fuses the image embedding with a null-text prompt. Unlike traditional approaches that rely on fixed target features, UNION enhances semantic alignment with multimodal queries while requiring no architectural modifications to pretrained vision-language models. With only 5,000 training samples - from LlavaSCo for CIR and Training-Sketchy for SBIR - our method achieves competitive results across benchmarks, including CIRCO mAP@50 of 38.5 and Sketchy mAP@200 of 82.7, surpassing many heavily supervised baselines. This demonstrates the robustness and efficiency of UNION in bridging vision and language across diverse query types.

Paper Structure

This paper contains 26 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An Overview of Model Architecture in Composed Image Retrieval. Normally, the models do not have Null Text and UNION Feature and directly rank the similarity score of Fusion feature with Image Pool's embedded feature.
  • Figure 2: UNION architecture combining all images to be retrieved and Null Text.
  • Figure 3: LLaVA Caption for the target image and also the reference image.
  • Figure 4: Qualitative Results on CIRCO validation set . The ground truth images are red-underlined.
  • Figure 5: Heatmap showing average performance across different target feature types (original, sum, UNION) and backbones (CLIP-B, CLIP-L, BLIP), with and without enhanced captions from LlavaSCo, on the ZS-CIR task. UNION consistently improves performance in caption-rich scenarios, especially with stronger backbones like BLIP and CLIP-L, while the benefit diminishes slightly when trained without text.
  • ...and 2 more figures