Table of Contents
Fetching ...

StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human

Takato Yoshikawa, Yuki Endo, Yoshihiro Kanamori

TL;DR

This work introduces StyleHumanCLIP, a text-guided garment editing framework for StyleGAN-Human that preserves identity while editing full-body garments. The core idea is an attention-based latent code mapper that uses cross-attention between latent codes and CLIP text embeddings to generate a latent residual $\ abla w$, which is added to the input latent code $w$ in the $W^+$ space to obtain $w'$. To constrain edits to garment regions, the method employs feature-space masking, computing masks from a human parsing model and blending feature maps via the mask $M = P_t(G(w)) \cup P_t(G(w'))$. The approach is trained with CLIP-based losses, including a directional CLIP component, and evaluated against StyleCLIP, HairCLIP+, and diffusion-based methods, showing improved text fidelity and identity preservation, including applicability to real images via GAN inversion. Overall, StyleHumanCLIP advances text-based editing for full-body humans by integrating an attention-enabled latent mapper with inference-time masking, offering practical garment manipulation with preserved subject identity.

Abstract

This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.

StyleHumanCLIP: Text-guided Garment Manipulation for StyleGAN-Human

TL;DR

This work introduces StyleHumanCLIP, a text-guided garment editing framework for StyleGAN-Human that preserves identity while editing full-body garments. The core idea is an attention-based latent code mapper that uses cross-attention between latent codes and CLIP text embeddings to generate a latent residual , which is added to the input latent code in the space to obtain . To constrain edits to garment regions, the method employs feature-space masking, computing masks from a human parsing model and blending feature maps via the mask . The approach is trained with CLIP-based losses, including a directional CLIP component, and evaluated against StyleCLIP, HairCLIP+, and diffusion-based methods, showing improved text fidelity and identity preservation, including applicability to real images via GAN inversion. Overall, StyleHumanCLIP advances text-based editing for full-body humans by integrating an attention-enabled latent mapper with inference-time masking, offering practical garment manipulation with preserved subject identity.

Abstract

This paper tackles text-guided control of StyleGAN for editing garments in full-body human images. Existing StyleGAN-based methods suffer from handling the rich diversity of garments and body shapes and poses. We propose a framework for text-guided full-body human image synthesis via an attention-based latent code mapper, which enables more disentangled control of StyleGAN than existing mappers. Our latent code mapper adopts an attention mechanism that adaptively manipulates individual latent codes on different StyleGAN layers under text guidance. In addition, we introduce feature-space masking at inference time to avoid unwanted changes caused by text inputs. Our quantitative and qualitative evaluations reveal that our method can control generated images more faithfully to given texts than existing methods.
Paper Structure (27 sections, 8 equations, 14 figures, 5 tables)

This paper contains 27 sections, 8 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Garment editing comparison of existing methods and ours. StyleCLIP erroneously changed the facial identity and pants. HairCLIP+ (a HairCLIP variant trained with the same loss functions as ours) neglects the textual input due to its poor editing capability. Contrarily, our method successfully achieves virtual try-on of "a long-sleeve T-shirt" while preserving the facial identity and pants.
  • Figure 2: Overview of the proposed framework. The mapper network translates the latent codes $w$ to the latent codes $w'$ reflecting the text input. In the training time, only the mapper network is trained, and the other networks are freezed.
  • Figure 3: Architecture of our latent code mapper.
  • Figure 4: Overview of feature-space masking.
  • Figure 5: Qualitative comparison of pixel-space masking and feature-space masking.
  • ...and 9 more figures