Virtual Fashion Photo-Shoots: Building a Large-Scale Garment-Lookbook Dataset
Yannick Hauri, Luca A. Lanzendörfer, Till Aczel
TL;DR
This paper tackles generating editorial-style fashion imagery conditioned on garment images, addressing the gap between shop-focused images and creative lookbook visuals. It introduces a zero-shot, ensemble retrieval pipeline that combines SigLIP2, vision-language reasoning, and open-vocabulary object detection to pair garment images with lookbook content, enabling automatic dataset construction. The authors assemble the first large-scale garment–lookbook dataset (~550k lookbook/runway images and ~9.5M garments) and create three quality splits (10k/50k/300k) to support different training regimes. Empirical results show strong retrieval performance and clear benefits from model ensembling, achieving competitive or state-of-the-art recall on DressCode and DeepFashion2 benchmarks. The dataset and methodology pave the way for diffusion-based generation of context-rich virtual photo-shoots that blend fashion imagery with narrative environments.
Abstract
Fashion image generation has so far focused on narrow tasks such as virtual try-on, where garments appear in clean studio environments. In contrast, editorial fashion presents garments through dynamic poses, diverse locations, and carefully crafted visual narratives. We introduce the task of virtual fashion photo-shoot, which seeks to capture this richness by transforming standardized garment images into contextually grounded editorial imagery. To enable this new direction, we construct the first large-scale dataset of garment-lookbook pairs, bridging the gap between e-commerce and fashion media. Because such pairs are not readily available, we design an automated retrieval pipeline that aligns garments across domains, combining visual-language reasoning with object-level localization. We construct a dataset with three garment-lookbook pair accuracy levels: high quality (10,000 pairs), medium quality (50,000 pairs), and low quality (300,000 pairs). This dataset offers a foundation for models that move beyond catalog-style generation and toward fashion imagery that reflects creativity, atmosphere, and storytelling.
