Table of Contents
Fetching ...

FashionEngine: Interactive 3D Human Generation and Editing via Multimodal Controls

Tao Hu, Fangzhou Hong, Zhaoxi Chen, Ziwei Liu

TL;DR

FashionEngine tackles interactive 3D human generation/editing using multimodal controls by learning a 3D human prior in a semantic UV latent space and unifying inputs via a Multimodality-UV Space. It introduces UV-aligned samplers (Text-UV and Sketch-UV) and a two-stage prior learning pipeline via StructLDM to enable conditional and unconditional generation, editing, and 3D virtual try-on with view-consistent results. The approach demonstrates state-of-the-art performance on text- and sketch-driven editing tasks and offers interactive, near real-time performance (~9.2 FPS at 512^2 resolution) on a V100. While effective, it notes limitations such as dataset bias toward female dresses and ongoing challenges with skirt variations, pointing to future work in data diversification and garment coverage.

Abstract

We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine.

FashionEngine: Interactive 3D Human Generation and Editing via Multimodal Controls

TL;DR

FashionEngine tackles interactive 3D human generation/editing using multimodal controls by learning a 3D human prior in a semantic UV latent space and unifying inputs via a Multimodality-UV Space. It introduces UV-aligned samplers (Text-UV and Sketch-UV) and a two-stage prior learning pipeline via StructLDM to enable conditional and unconditional generation, editing, and 3D virtual try-on with view-consistent results. The approach demonstrates state-of-the-art performance on text- and sketch-driven editing tasks and offers interactive, near real-time performance (~9.2 FPS at 512^2 resolution) on a V100. While effective, it notes limitations such as dataset bias toward female dresses and ongoing challenges with skirt variations, pointing to future work in data diversification and garment coverage.

Abstract

We present FashionEngine, an interactive 3D human generation and editing system that creates 3D digital humans via user-friendly multimodal controls such as natural languages, visual perceptions, and hand-drawing sketches. FashionEngine automates the 3D human production with three key components: 1) A pre-trained 3D human diffusion model that learns to model 3D humans in a semantic UV latent space from 2D image training data, which provides strong priors for diverse generation and editing tasks. 2) Multimodality-UV Space encoding the texture appearance, shape topology, and textual semantics of human clothing in a canonical UV-aligned space, which faithfully aligns the user multimodal inputs with the implicit UV latent space for controllable 3D human editing. The multimodality-UV space is shared across different user inputs, such as texts, images, and sketches, which enables various joint multimodal editing tasks. 3) Multimodality-UV Aligned Sampler learns to sample high-quality and diverse 3D humans from the diffusion prior. Extensive experiments validate FashionEngine's state-of-the-art performance for conditional generation/editing tasks. In addition, we present an interactive user interface for our FashionEngine that enables both conditional and unconditional generation tasks, and editing tasks including pose/view/shape control, text-, image-, and sketch-driven 3D human editing and 3D virtual try-on, in a unified framework. Our project page is at: https://taohuumd.github.io/projects/FashionEngine.
Paper Structure (40 sections, 4 equations, 14 figures, 3 tables)

This paper contains 40 sections, 4 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: 3D human prior learning structldm in two stages (S1 and S2). S1 learns an auto-decoder containing a set of structured embeddings ${Z}$ corresponding to the human subjects in the training dataset. The embeddings ${Z}$ are then employed to train a latent diffusion model in the semantic UV latent space in the second stage.
  • Figure 2: Multimodality-UV Space (Sec. \ref{['sec:method_space']}). Based on the learned prior $\mathcal{Z}$, we construct a Multimodality-UV space including an Appearance-Canonical Space (App-Can, $\mathbb{A}^{can}$), an Appearance-UV Space (App-UV, $\mathbb{A}^{uv}$), a textual Semantics-UV Space (Sem-UV, $\mathbb{T}^{uv}$), and Shape-UV Space ($\mathbb{S}^{uv}$).
  • Figure 3: Pipeline of multimodal generation Sec. \ref{['sec:method_generation']}. (a) Text- and sketch-driven generation: Given text input $\mathbf{I}_T$ or sketch input $\mathbf{I}^{uv}_S$ in the template UV space, we present Text-UV Aligned Samplers and Sketch-UV Aligned Samplers to sample latents ($z^{*}_{T}$ and $z^{*}_{S}$) from the learned human prior $\mathcal{Z}$ (Sec \ref{['sec:method_prior']}) respectively, which can be rendered into images by latent diffusion and rendering (Diff-Render) structldm. (b) Illustration of TextMatch and ShapeMatch: $\{z_k, z_i\}$$\raisebox{.5pt}{\textcircled{1}}\raisebox{.5pt}{\textcircled{2}}$ and $\{z_j, z_i\}$$\raisebox{.5pt}{\textcircled{3}}\raisebox{.5pt}{\textcircled{2}}$ are taken as the best match to construct the target latents ($z^{*}_{T}$ and $z^{*}_{S}$) for the text or sketch input based on the TextMatch and ShapeMatch algorithms respectively.
  • Figure 4: Text-, Sketch-, and Image-Driven Editing (Sec. \ref{['sec:method_editing']}). To edit a source human with latent $z$, FashionEngine allows users to type texts $\mathbf{I}_T$, draw sketches $S^{img}$, or provide a reference image with sketh masks for style transfer, and the target latents are constructed corresponding to the user inputs. Note that the sketch can describe the length of sleeves in two different ways (e.g., ①③), or describe the geometry (e.g., ②).
  • Figure 5: Illustration of Sketch Parser. Sketch Parser supports 4 different types of sketch input: sleeve or dress length in two manners ①②, neckline shape ③, or a closed area ④. Sketches are transformed from the image space to the unified partial UV space (warped by single view as shown in Fig. \ref{['fig:m2space']}), and expanded into a mask in the full UV space based on body topology prior.
  • ...and 9 more figures