CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Jisu Shin; Junmyeong Lee; Seongmin Lee; Min-Gyu Park; Ju-Mi Kang; Ju Hong Yoon; Hae-Gon Jeon

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Jisu Shin, Junmyeong Lee, Seongmin Lee, Min-Gyu Park, Ju-Mi Kang, Ju Hong Yoon, Hae-Gon Jeon

TL;DR

CanonicalFusion tackles the problem of generating animatable drivable 3D human avatars from multiple images by predicting dual‑sided depth and compressed LBS maps and directly canonicalizing the mesh into a shared canonical space. It introduces a shared‑encoder dual‑decoder network to produce depth and LBS maps, compresses the LBS weights to a 3‑D latent vector, and reconstructs a canonical mesh that is refined through forward skinning‑based differentiable rendering across multiple views. Key contributions include the compressed LBS weight representation, a forward skinning rendering framework for multi‑image fusion, and a robust canonicalization pipeline that fills holes via signed distance integration and Flexicubes reparameterization. The method demonstrates improved accuracy over state‑of‑the‑art approaches on public datasets and shows practical applicability in‑the‑wild, with open‑source code available for replication and extension.

Abstract

We present a novel framework for reconstructing animatable human avatars from multiple images, termed CanonicalFusion. Our central concept involves integrating individual reconstruction results into the canonical space. To be specific, we first predict Linear Blend Skinning (LBS) weight maps and depth maps using a shared-encoder-dual-decoder network, enabling direct canonicalization of the 3D mesh from the predicted depth maps. Here, instead of predicting high-dimensional skinning weights, we infer compressed skinning weights, i.e., 3-dimensional vector, with the aid of pre-trained MLP networks. We also introduce a forward skinning-based differentiable rendering scheme to merge the reconstructed results from multiple images. This scheme refines the initial mesh by reposing the canonical mesh via the forward skinning and by minimizing photometric and geometric errors between the rendered and the predicted results. Our optimization scheme considers the position and color of vertices as well as the joint angles for each image, thereby mitigating the negative effects of pose errors. We conduct extensive experiments to demonstrate the effectiveness of our method and compare our CanonicalFusion with state-of-the-art methods. Our source codes are available at https://github.com/jsshin98/CanonicalFusion.

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

TL;DR

Abstract

Paper Structure (16 sections, 13 equations, 9 figures, 3 tables)

This paper contains 16 sections, 13 equations, 9 figures, 3 tables.

Introduction
Related Works
Clothed Human Reconstruction
Drivable Human Avatar Generation
The Proposed Method
Preliminaries on Linear Blend skinning
Joint Depth and LBS Prediction
Compact representation of skinning weights.
Objective Function.
Texture Prediction
Canonical Mesh Reconstruction
Incorporating Multiple Images into Forward Skinning-based Differentiable Rendering
Experimental Results
Quantitative and Qualitative Evaluations
Ablation Study
...and 1 more sections

Figures (9)

Figure 1: Our framework, CanonicalFusion, generates a drivable avatar from multiple images.
Figure 2: An overview of our framework, CanonicalFusion. It takes RGB image and depth maps generated from SMPL-X and estimates dual-sided depth and 3-dimensional LBS weight map. Original skinning weights are decoded from compressed LBS weight maps and used to generate a canonicalized mesh. To further increase the quality, canonical mesh is refined by integrating multiple frames with forward skinning based differentiable rendering.
Figure 3: (a) UV map and SMPL mesh colored with encoded skinning weights. (b) Reposed mesh using decoded LBS weight from the pretrained decoder.
Figure 4: Comparison of normal maps. The first two rows are from the TH3.0 test data, and the latter two are from the RP dataset.
Figure 5: Comparison of results between SCANimate and our method. We used the same SMPL pose parameters for SCANimate and ours. Five and fifteen scans were used to canonicalize the meshes.
...and 4 more figures

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

TL;DR

Abstract

CanonicalFusion: Generating Drivable 3D Human Avatars from Multiple Images

Authors

TL;DR

Abstract

Table of Contents

Figures (9)