Table of Contents
Fetching ...

BAG: Body-Aligned 3D Wearable Asset Generation

Zhongjin Luo, Yang Li, Mingrui Zhang, Senbo Wang, Han Yan, Xibin Song, Taizhang Shang, Wei Mao, Hongdong Li, Xiaoguang Han, Pan Ji

TL;DR

BAG addresses automatic generation of body-aligned 3D wearable assets by conditioning diffusion-based 3D generation on body shape and pose. It introduces a body-conditioned multiview diffusion module trained on a large 3D asset corpus and guided by a ControlNet on body coordinate maps, followed by a native 3D diffusion model to synthesize the asset, and a Sim(3) alignment plus physics-based penetration solver to fit the asset onto a target body. The approach achieves improved prompt-following, diversity, and geometry quality relative to single-view garment reconstruction baselines, demonstrated through quantitative metrics and qualitative results, across multiple input acquisition methods. This enables automatic, scalable dressing of 3D avatars with high geometric fidelity and lays groundwork for broader, automated 3D garment generation with future improvements in multi-layer garments and topological robustness.

Abstract

While recent advancements have shown remarkable progress in general 3D shape generation models, the challenge of leveraging these approaches to automatically generate wearable 3D assets remains unexplored. To this end, we present BAG, a Body-aligned Asset Generation method to output 3D wearable asset that can be automatically dressed on given 3D human bodies. This is achived by controlling the 3D generation process using human body shape and pose information. Specifically, we first build a general single-image to consistent multiview image diffusion model, and train it on the large Objaverse dataset to achieve diversity and generalizability. Then we train a Controlnet to guide the multiview generator to produce body-aligned multiview images. The control signal utilizes the multiview 2D projections of the target human body, where pixel values represent the XYZ coordinates of the body surface in a canonical space. The body-conditioned multiview diffusion generates body-aligned multiview images, which are then fed into a native 3D diffusion model to produce the 3D shape of the asset. Finally, by recovering the similarity transformation using multiview silhouette supervision and addressing asset-body penetration with physics simulators, the 3D asset can be accurately fitted onto the target human body. Experimental results demonstrate significant advantages over existing methods in terms of image prompt-following capability, shape diversity, and shape quality. Our project page is available at https://bag-3d.github.io/.

BAG: Body-Aligned 3D Wearable Asset Generation

TL;DR

BAG addresses automatic generation of body-aligned 3D wearable assets by conditioning diffusion-based 3D generation on body shape and pose. It introduces a body-conditioned multiview diffusion module trained on a large 3D asset corpus and guided by a ControlNet on body coordinate maps, followed by a native 3D diffusion model to synthesize the asset, and a Sim(3) alignment plus physics-based penetration solver to fit the asset onto a target body. The approach achieves improved prompt-following, diversity, and geometry quality relative to single-view garment reconstruction baselines, demonstrated through quantitative metrics and qualitative results, across multiple input acquisition methods. This enables automatic, scalable dressing of 3D avatars with high geometric fidelity and lays groundwork for broader, automated 3D garment generation with future improvements in multi-layer garments and topological robustness.

Abstract

While recent advancements have shown remarkable progress in general 3D shape generation models, the challenge of leveraging these approaches to automatically generate wearable 3D assets remains unexplored. To this end, we present BAG, a Body-aligned Asset Generation method to output 3D wearable asset that can be automatically dressed on given 3D human bodies. This is achived by controlling the 3D generation process using human body shape and pose information. Specifically, we first build a general single-image to consistent multiview image diffusion model, and train it on the large Objaverse dataset to achieve diversity and generalizability. Then we train a Controlnet to guide the multiview generator to produce body-aligned multiview images. The control signal utilizes the multiview 2D projections of the target human body, where pixel values represent the XYZ coordinates of the body surface in a canonical space. The body-conditioned multiview diffusion generates body-aligned multiview images, which are then fed into a native 3D diffusion model to produce the 3D shape of the asset. Finally, by recovering the similarity transformation using multiview silhouette supervision and addressing asset-body penetration with physics simulators, the 3D asset can be accurately fitted onto the target human body. Experimental results demonstrate significant advantages over existing methods in terms of image prompt-following capability, shape diversity, and shape quality. Our project page is available at https://bag-3d.github.io/.

Paper Structure

This paper contains 12 sections, 2 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Method Pipeline. Given an input image and a target body, we employ body-conditioned image generation to produce body-aligned consistent four-view orthographic images (see Sec. \ref{['sec:body_0123']}). The four-view images are then fed into a native 3D diffusion model to obtain the asset shape. The similarity transformation (Sim3) of the generated asset is estimated through silhouette-based projection optimization (see Sec. \ref{['sec:sim3_estimation']}). Finally, after solving the body-asset penetration, the Sim3-transformed asset is fitted onto the human body (see Sec. \ref{['sec:penetration_solve']}). The means for obtaining the input body and image pair are detailed in Sec. \ref{['sec:input_aquisition']}.
  • Figure 2: Canonical body space (left), and examples of body-aligned 3D asset dataset (right). The color on the body surface is obtained by scaling the canonical XYZ values to the range of [0-255].
  • Figure 3: Penetration Handling. Despite the application of the Sim(3) transformation, penetrations between the asset and the body persist, as illustrated in the Initial Alignment. To address this, a proxy mesh is employed, which retains the essential geometry of the visual mesh and serves as a representative for cloth simulation. The Final Alignment showcases the penetration-free state of the asset and body post-simulation.
  • Figure 4: Four Methods to acqure input body and image pairs. a) SMPLX Fitting. b)Sketch-Based Modeling. c) Virtual Try-on. d) Manual Images Assembly.
  • Figure 5: Qualitative asset shape generation results.
  • ...and 5 more figures