Table of Contents
Fetching ...

GALA: Generating Animatable Layered Assets from a Single Scan

Taeksoo Kim, Byungjun Kim, Shunsuke Saito, Hanbyul Joo

TL;DR

GALA addresses the challenge of turning a single static scan of a clothed human into animatable, multi-layered 3D assets suitable for garment transfer and avatar reanimation. The method decomposes geometry and texture into two layers in shared canonical spaces, using Deep Marching Tetrahedra for geometry and a pose-guided SDS loss, driven by a pretrained 2D diffusion prior, to inpaint occluded regions. By coupling reconstruction, segmentation, and refinement losses with diffusion-based guidance, GALA achieves robust canonicalization, layered decomposition, and high-quality compositing for novel identities and poses. The authors establish an evaluation protocol and demonstrate superior performance over baselines in both qualitative and quantitative analyses, offering a practical pipeline for automatic asset creation from a single scan.

Abstract

We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.

GALA: Generating Animatable Layered Assets from a Single Scan

TL;DR

GALA addresses the challenge of turning a single static scan of a clothed human into animatable, multi-layered 3D assets suitable for garment transfer and avatar reanimation. The method decomposes geometry and texture into two layers in shared canonical spaces, using Deep Marching Tetrahedra for geometry and a pose-guided SDS loss, driven by a pretrained 2D diffusion prior, to inpaint occluded regions. By coupling reconstruction, segmentation, and refinement losses with diffusion-based guidance, GALA achieves robust canonicalization, layered decomposition, and high-quality compositing for novel identities and poses. The authors establish an evaluation protocol and demonstrate superior performance over baselines in both qualitative and quantitative analyses, offering a practical pipeline for automatic asset creation from a single scan.

Abstract

We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.
Paper Structure (43 sections, 19 equations, 21 figures, 4 tables)

This paper contains 43 sections, 19 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: GALA. Given a single-layer 3D mesh of a clothed human (left), our approach enables Generation of Animatable Layered Assets for 3D garment transfer and avatar customization in any poses by decomposing and inpainting the geometry and texture of each layer with a pretrained 2D diffusion model in a canonical space.
  • Figure 2: Overview. GALA learns an object and the remaining human layers in a canonical space using DMTet shen2021dmtet. The canonical space colored orange and the original posed space colored purple are differentiably associated with linear blend skinning (LBS). Our novel pose-guided SDS loss (right) guides the decomposition and inpainting in both the canonical and posed space. We also retain the fidelity of visible regions via a reconstruction and segmentation loss (left-bottom).
  • Figure 3: Decomposition and Synthesis. We decompose humans and objects using 3D segmentation lifted from 2D and synthesize plausible geometry of the missing regions using pose-guided SDS.
  • Figure 4: Texture Generation. Applying SDS loss in canonical space generates texture for regions occluded by objects along with self-occluded regions.
  • Figure 5: Decomposition and Canonicalization. In each set, we show the decomposition and canonicalization results of the leftmost sample.
  • ...and 16 more figures