Table of Contents
Fetching ...

Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

Ziyue Liu, Davide Talon, Federico Girella, Zanxi Ruan, Mattia Mondo, Loris Bazzani, Yiming Wang, Marco Cristani

TL;DR

This work presents LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs and develops Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.

Abstract

Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.

Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation

TL;DR

This work presents LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs and develops Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image.

Abstract

Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
Paper Structure (28 sections, 7 equations, 7 figures, 4 tables)

This paper contains 28 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: LOTS enables automation of the fashion design process at a new level of detail. The figure illustrates a design scenario where sketches are complemented by natural language descriptions to characterize garment material, style, and structure. LOTS represents a paradigm shift in design methodologies, advancing from global () text with global sketch (IP-Adapter ye2023ip) and global text with localized () sketches (Multi-ControlNet zhang2023adding). Our approach adds localized sketch-text specifications (the coloured boxes), enabling fine-grained control over the layout and attributes of multiple garment items. All textual descriptions are shown in a contracted form for readability, see text.
  • Figure 2: LOTS pipeline. 1. The first Multi-level conditioning stage constructs a conditioning representation spanning both local and global levels. Locally, the Modularized Pair-Centric Representation module (Sec. \ref{['sec:method:pair-centric-representation']}) handles each sketch–text pair independently: modality-specific, frozen encoders first map sketches and texts into their respective embeddings, which are then fused in the Pair-Former by integrating textual semantics with the spatial structure of the corresponding sketch. In parallel, the Global Conditioning (Sec. \ref{['sec:method:global-conditioning']}) derives a global representation from the full sketch and injects it via cross-attention to promote consistency and interaction across multiple pairs. 2. In the subsequent Diffusion Pair Guidance stage (Sec. \ref{['sec:method:diffusion-guidance']}), the multi-level embeddings are progressively incorporated into the diffusion process, together with the Global Context Description which drives the background generation and shapes the overall style. Rather than explicitly merging all pair representations upfront, conditioning is applied throughout the denoising process, enabling gradual integration and preventing the attribute leakage typically induced by early representation fusion.
  • Figure 3: Overview of Sketchy. We build a hierarchical structure by pairing the garment part annotations to their related whole-body garment. Then, garment-level sketches and natural language descriptions are added based on off-the-shelf models and the in-the-wild sketch collection pipeline. The bar charts illustrate the frequency of the prevalent categories within the dataset, representing the total count of annotations where whole-body items (left) and garment parts (right) appear.
  • Figure 4: Examples of sketches in the Sketchy and Sketchy in the Wild dataset. The left column presents automatically annotated sketches in Sketchy. The middle column shows collected human-drawn sketches in the Sketchy in the Wild subset. The right column displays the corresponding original fashion images. Human-drawn sketches exhibit higher subjective abstraction and stylistic variability.
  • Figure 5: Qualitative comparison of LOTS with our prior work LOTS* girella2025lots, ControlNet zhang2023adding, IP-Adapter ye2023ip, and Multi-T2I-adapter mou2024t2i, all in their fine-tuned versions. Given localized sketch-text pairs as conditioning inputs, LOTS better capture fine-grained attributes within the intended local regions of the generated images, effectively mitigating attribute confusion while maintaining strong global structural alignment.
  • ...and 2 more figures