Table of Contents
Fetching ...

Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset

Yannick Hauri, Luca A. Lanzendörfer, Till Aczel

TL;DR

This paper tackles generating editorial-style fashion imagery conditioned on garment images, addressing the gap between shop-focused images and creative lookbook visuals. It introduces a zero-shot, ensemble retrieval pipeline that combines SigLIP2, vision-language reasoning, and open-vocabulary object detection to pair garment images with lookbook content, enabling automatic dataset construction. The authors assemble the first large-scale garment–lookbook dataset (~550k lookbook/runway images and ~9.5M garments) and create three quality splits (10k/50k/300k) to support different training regimes. Empirical results show strong retrieval performance and clear benefits from model ensembling, achieving competitive or state-of-the-art recall on DressCode and DeepFashion2 benchmarks. The dataset and methodology pave the way for diffusion-based generation of context-rich virtual photo-shoots that blend fashion imagery with narrative environments.

Abstract

Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.

Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset

TL;DR

This paper tackles generating editorial-style fashion imagery conditioned on garment images, addressing the gap between shop-focused images and creative lookbook visuals. It introduces a zero-shot, ensemble retrieval pipeline that combines SigLIP2, vision-language reasoning, and open-vocabulary object detection to pair garment images with lookbook content, enabling automatic dataset construction. The authors assemble the first large-scale garment–lookbook dataset (~550k lookbook/runway images and ~9.5M garments) and create three quality splits (10k/50k/300k) to support different training regimes. Empirical results show strong retrieval performance and clear benefits from model ensembling, achieving competitive or state-of-the-art recall on DressCode and DeepFashion2 benchmarks. The dataset and methodology pave the way for diffusion-based generation of context-rich virtual photo-shoots that blend fashion imagery with narrative environments.

Abstract

Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.

Paper Structure

This paper contains 10 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Difference in garment, shop, and lookbook image. Existing datasets provide clean shop images, not suitable for virtual photoshoot model training.
  • Figure 2: Left: Examples of lookbook images (gallery) and garment images (query), showing a sample where one garment appears in both a gallery and a query image. Matching gallery-query pairs form the basis of our dataset, and the task is to find all such matches. Right: Overview of our retrieval pipeline. Query images are embedded with SigLIP2, while garment descriptions for gallery images are generated with a vision-language model (VLM). Object detection (OD) pconditioned on the garment description roduces bounding boxes for individual garments. Embeddings of gallery images, descriptions, and bounding boxes are compared with SigLIP2 to compute image-to-image, image-to-bbox, and image-to-text similarities.
  • Figure 3: Rank correlation heatmap between retrieval models. Values are rounded to two decimals and centered in each cell.
  • Figure 4: Shows the garment retrieval accuracy of our dataset at indices 100, 2000, 8000, 32000, 128000, 512000, and 2048000, obtained with qualitative evaluation of 200 garment–lookbook image pair samples at each index, where the dataset is sorted by the similarity scores between the garment and lookbook image pairs.