Table of Contents
Fetching ...

Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models

Qice Qin, Yuki Hirakawa, Ryotaro Shimizu, Takuya Furusawa, Edgar Simo-Serra

TL;DR

This work tackles enhancing fashionability in outfit image editing without external prompts by integrating a diffusion-based generator with segmentation-conditioned controls and a classifier-guided fashionability objective. It introduces two expert-annotated datasets (OpenSkill-based and 5-Dimension) to train and evaluate a Mid-U guidance system that steers latent diffusion toward more fashionable outputs while preserving body shape and identity. Empirical results show significant gains over the Fashion++ baseline in both quantitative fashionability predictions and qualitative image quality, supported by a user study and detailed failure analyses. The approach offers a practical, interpretable framework for automatic fashionability enhancement with potential impact on virtual try-on, fashion design, and e-commerce workflows.

Abstract

Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.

Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models

TL;DR

This work tackles enhancing fashionability in outfit image editing without external prompts by integrating a diffusion-based generator with segmentation-conditioned controls and a classifier-guided fashionability objective. It introduces two expert-annotated datasets (OpenSkill-based and 5-Dimension) to train and evaluate a Mid-U guidance system that steers latent diffusion toward more fashionable outputs while preserving body shape and identity. Empirical results show significant gains over the Fashion++ baseline in both quantitative fashionability predictions and qualitative image quality, supported by a user study and detailed failure analyses. The approach offers a practical, interpretable framework for automatic fashionability enhancement with potential impact on virtual try-on, fashion design, and e-commerce workflows.

Abstract

Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.

Paper Structure

This paper contains 18 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustrations of minimal edits to enhance fashionability. 1) Modifying white pants to brown and adding floral patterns to the blouse (left). 2) Adding a belt to the outfit as an accessory (center). 3) Adjusting the shape and size of a loose-fitting red jumpsuit to a more form-fitting and tailored version (right).
  • Figure 2: Overview of our proposed approach for conditional fashion image generation. Our approach consists of four main components: 1) a diffusion-based generation module that applies a diffusion process to iteratively refine the input image into a fashion-enhanced output while maintaining visual coherence; 2) a human parsing representation extractor that generates segmentation maps from the input image; 3) a mid-U classifier that processes mid-level UNet outputs to compute the fashion loss, which is fed back to the latent representation for fashionability optimization; and 4) an identity preservation process that combines the generated output with the segmentation map to restore the original subject's head onto the new image.
  • Figure 3: Illustration of the OpenSkill-based fashionability scoring process. The first part shows images ranked along an axis, representing their average fashionability scores with associated normal distributions. The second part depicts the annotation process, where human evaluators provide pairwise comparisons to determine which image is more fashionable. The third part reflects the updated scores and distributions of the images, adjusted based on the annotations.
  • Figure 4: Construction of the dataset. It shows how we divided them for each model's training and testing.
  • Figure 5: Qualitative comparison of our proposed approach and Fashion++ for the same input.
  • ...and 1 more figures