Table of Contents
Fetching ...

LRVS-Fashion: Extending Visual Search with Referring Instructions

Simon Lepage, Jérémie Mary, David Picard

TL;DR

This work tackles the ambiguity of fashion image similarity by proposing Referred Visual Search (RVS) and introducing LRVS-Fashion, a large-scale dataset with 272k products and 842k images, plus a test gallery containing up to 2M distractors. The authors design a lightweight, weakly-supervised conditional embedding method built on Vision Transformers, where an additional conditioning token is fed into the model and trained with the InfoNCE loss to align query and target conditioned on $c_q$/$c_t$, without relying on explicit object detectors. LRVS-Fashion is built from LAION-5B with synthetic metadata (captions, categories) to support referring information, enabling end-to-end training and strong robustness to distractors. Empirically, CondViT-based models achieve competitive or superior $R@1$ compared to strong detection-based baselines, with textual conditioning offering additional gains, demonstrating practical scalability for large catalogs and informing future research in conditional, referring-based visual search in fashion and beyond.

Abstract

This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion .

LRVS-Fashion: Extending Visual Search with Referring Instructions

TL;DR

This work tackles the ambiguity of fashion image similarity by proposing Referred Visual Search (RVS) and introducing LRVS-Fashion, a large-scale dataset with 272k products and 842k images, plus a test gallery containing up to 2M distractors. The authors design a lightweight, weakly-supervised conditional embedding method built on Vision Transformers, where an additional conditioning token is fed into the model and trained with the InfoNCE loss to align query and target conditioned on /, without relying on explicit object detectors. LRVS-Fashion is built from LAION-5B with synthetic metadata (captions, categories) to support referring information, enabling end-to-end training and strong robustness to distractors. Empirically, CondViT-based models achieve competitive or superior compared to strong detection-based baselines, with textual conditioning offering additional gains, demonstrating practical scalability for large catalogs and informing future research in conditional, referring-based visual search in fashion and beyond.

Abstract

This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion .
Paper Structure (50 sections, 3 equations, 15 figures, 5 tables)

This paper contains 50 sections, 3 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Overview of the Referred Visual Search task. Given a query image and conditioning information, the goal is to retrieve a target instance from a large gallery. Note that a query is made of an image and an additional text or category, precising what aspect of the image is relevant.
  • Figure 2: Overview of the data collection. a) Selection of a subset of domains belonging to known fashion retailers. b) Extraction of product identifiers in the URLs using domain-specific regular expressions. c) Generation of synthetic metadata for the products (categories, captions, ...) using both pretrained and finetuned models. d) Deduplication of the images, and assignment to subsets.
  • Figure 3: Samples from LRVS-F. Each product is represented on at least a simple and a complex image, and is associated with a category. The simple images are also described by captions from LAION and BLIP2. Please refer to Appendix \ref{['app:samples']} for more samples.
  • Figure 4: Overview of our method on LRVS-F. For each element in a batch, we embed the scene conditionally and the isolated item unconditionally. We optimize an InfoNCE loss over the cosine similarity matrix. $\oplus$ denotes concatenation to the patch sequence.
  • Figure 4: Examples of sub-categories.
  • ...and 10 more figures