Table of Contents
Fetching ...

Fashion Style Editing with Generative Human Prior

Chaerin Kong, Seungyong Lee, Soohyeok Im, Wonsuk Yang

TL;DR

This work tackles fashion style editing on full-body human images using text prompts. It introduces FaSE, a framework built on a StyleGAN-Human prior that uses three latent-mapper branches ($M^c$, $M^m$, $M^f$) in the $W^+$ space, guided by CLIP loss and regularization. To overcome the limitations of naive CLIP guidance, it employs textual augmentation via an LLM and visual reference retrieval from a curated fashion image database, incorporating $ ext{L}_{CLIP}$ and $ ext{L}_{Ref}$ losses, with references inverted into $W^+$ to provide vivid, illustrative signals. The approach also analyzes the hierarchical latent space to assign fashion edits to the appropriate levels (pose, garment shape, texture), and experiments show improvements over StyleCLIP in both qualitative and quantitative assessments, supported by human and AI-driven evaluations. Overall, FaSE enables robust, flexible, and interpretable fashion style edits with practical implications for fashion imaging and digital garment design.

Abstract

Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.

Fashion Style Editing with Generative Human Prior

TL;DR

This work tackles fashion style editing on full-body human images using text prompts. It introduces FaSE, a framework built on a StyleGAN-Human prior that uses three latent-mapper branches (, , ) in the space, guided by CLIP loss and regularization. To overcome the limitations of naive CLIP guidance, it employs textual augmentation via an LLM and visual reference retrieval from a curated fashion image database, incorporating and losses, with references inverted into to provide vivid, illustrative signals. The approach also analyzes the hierarchical latent space to assign fashion edits to the appropriate levels (pose, garment shape, texture), and experiments show improvements over StyleCLIP in both qualitative and quantitative assessments, supported by human and AI-driven evaluations. Overall, FaSE enables robust, flexible, and interpretable fashion style edits with practical implications for fashion imaging and digital garment design.

Abstract

Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.
Paper Structure (11 sections, 2 equations, 5 figures, 1 table)

This paper contains 11 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of text-driven fashion style edition using our framework (FaSE). Given an input image, a text prompt is used to drive change in garment style while preserving its overall attributes.
  • Figure 2: Using CLIP text guidance as in StyleCLIP patashnik2021styleclip is insufficient to steer fashion concepts in full-body images, and image guidance in CLIP feature space often distorts the semantics (left). FaSE, in contrast, successfully projects non-trivial fashion styles ('street-fashion' in this example) onto human images with both type of guidances (right).
  • Figure 3: Summary of our fashion style editing framework (FaSE). We learn a latent mapper patashnik2021styleclip with two types of illustrative guidance. (A) We transform fashion style concepts into visual descriptions with a pretrained language model. (B) From our preconstructed fashion image database, we retrieve top-k reference images that both suit the target text prompt and resemble the source image, which are used as additional visual guidance.
  • Figure 4: From the top, we train $M^c$, $M^m$, and $M^f$, respectively, for the prompt 'suit'. When editing the garment shape as in 'suit', $M^m$ needs to be modified to transform the mid part of $\mathbf{w}$.
  • Figure 5: Qualitative comparison with StyleCLIP using text and image guidance signal.