Table of Contents
Fetching ...

Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control

Linzi Qu, Jiaxiang Shang, Hui Ye, Xiaoguang Han, Hongbo Fu

TL;DR

Sketch2Human tackles controllable full-body human image generation by conditioning geometry on semantic sketches and appearance on reference images. It introduces a two-stage approach: Sketch Image Inversion, which maps sketches into the StyleGAN-Human latent space, and Body Generator Tuning, which uses synthetic, style-mixed data to disentangle geometry and appearance and fine-tune the generator. The method enables explicit control over body contours and garment textures, handles hand-drawn sketches, and supports appearance transfer from real or synthetic references, demonstrated via extensive qualitative and quantitative evaluations and user studies. The work advances practical editing and design tasks (fashion design, avatar creation, virtual try-on) by providing flexible, disentangled geometry-appearance control for full-body generation.

Abstract

Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.

Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control

TL;DR

Sketch2Human tackles controllable full-body human image generation by conditioning geometry on semantic sketches and appearance on reference images. It introduces a two-stage approach: Sketch Image Inversion, which maps sketches into the StyleGAN-Human latent space, and Body Generator Tuning, which uses synthetic, style-mixed data to disentangle geometry and appearance and fine-tune the generator. The method enables explicit control over body contours and garment textures, handles hand-drawn sketches, and supports appearance transfer from real or synthetic references, demonstrated via extensive qualitative and quantitative evaluations and user studies. The work advances practical editing and design tasks (fashion design, avatar creation, virtual try-on) by providing flexible, disentangled geometry-appearance control for full-body generation.

Abstract

Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
Paper Structure (23 sections, 2 equations, 13 figures, 4 tables)

This paper contains 23 sections, 2 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Our Sketch2Human generates high-quality full-body images with respect to an input semantic sketch for geometry control and a reference image for appearance control. (a), (b), (c), and (d) correspond to four different sketch inputs.
  • Figure 2: The examples of inputs and style-mixing results. (a) a sketch input. (b) the corresponding parsing map. (c) the semantic sketch produced from (a) and (b). (d) an input appearance input. (e)-(g) results from mixing at layers 6, 8, and 10, respectively.
  • Figure 3: An illustration about the training (right) and inference (left) pipelines of our method for full-body human image generation conditioned on a semantic sketch $PS_g$ and a reference image $I_a$. The training pipeline consists of two main modules: Sketch Image Inversion (right-top) and Body Generator Tuning (right-bottom). In the Sketch Image Inversion module, we first sample a latent code $w_g$ to generate the training triplet (semantic sketch $PS_g$, parsing map $P_g$, sketch $S_g$). Then, we use these data to train a sketch encoder. In the Body Generator Tuning module, given an appearance code $w_a$, we also sample a latent code $w_g$ to prepare the training appearance-transferred $I_{mix6}$ and geometry-preserved $I_{mix10}$ samples via style mixing at different layers. Then, we use them to fine-tune the generator $G(w; \theta^{'})$. During inference, the sketch encoder first embeds $PS_g$ into a latent code and mixes it with the appearance code derived from $I_a$ to form $w_{mix8}$. Given $w_{mix8}$, $G(w; \theta^{'})$ produces the final result $I_{syn}$.
  • Figure 4: Three examples of prepared data. Each example in (a)-(c) shows an input semantic sketch and its corresponding appearance-transferred and geometry-preserved results.
  • Figure 5: Qualitative comparisons between our method and four related sketch-based methods. Our method shows the best geometry and appearance transfer results. The sketch images are extracted from the DeepFashion dataset. The appearance images sampled from the StyleGAN-Human include pure color and texture images. Pure color images denote garments containing one or more colors without fabric patterns (a)-(b), while texture images include both (c)-(f).
  • ...and 8 more figures