Table of Contents
Fetching ...

FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

Rong Wang, Fabian Prada, Ziyan Wang, Zhongshi Jiang, Chengxiang Yin, Junxuan Li, Shunsuke Saito, Igor Santesteban, Javier Romero, Rohan Joshi, Hongdong Li, Jason Saragih, Yaser Sheikh

TL;DR

FRESA tackles animatable avatar reconstruction from few images by leveraging a universal clothed-human prior to jointly infer canonical geometry, skinning weights, and pose-dependent deformations in a feed-forward framework. It introduces a 3D canonicalization step and multi-frame aggregation to stabilize representations and preserve identity, trained end-to-end on a large dome dataset to achieve zero-shot generalization to casual photos. The method delivers superior geometry fidelity and animation quality compared to state-of-the-art baselines while offering fast inference suitable for consumer devices, enabling practical use in XR, virtual try-on, and telepresence.

Abstract

We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.

FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

TL;DR

FRESA tackles animatable avatar reconstruction from few images by leveraging a universal clothed-human prior to jointly infer canonical geometry, skinning weights, and pose-dependent deformations in a feed-forward framework. It introduces a 3D canonicalization step and multi-frame aggregation to stabilize representations and preserve identity, trained end-to-end on a large dome dataset to achieve zero-shot generalization to casual photos. The method delivers superior geometry fidelity and animation quality compared to state-of-the-art baselines while offering fast inference suitable for consumer devices, enabling practical use in XR, virtual try-on, and telepresence.

Abstract

We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.

Paper Structure

This paper contains 12 sections, 13 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: FRESA. We present a novel method to reconstruct personalized skinned avatars with realistic pose-dependent animation all in a feed-forward approach, which generalizes to causally taken phone photos without any fine-tuning. We visualize the predicted skinning weights associated with the most important joints in (b) and colormaps of per-vertex displacement magnitudesduring animation in (c).
  • Figure 2: Method Overview. We propose a novel method to feed-forwardly reconstruct personalized skinned avatars via a universal clothed human model. Specifically, given $N$ frames of posed human images $\{\mathbf{I}_i\}$ from front and back views, we first estimate their normal and segmentation images, and then unpose them for each frame and view to produce pixel-aligned initial conditions in a 3D canonicalization process (Section \ref{['sec31']}). Next, we propose to aggregate mult-frame references and produce a single bi-plane feature $\mathbf{B}$ as the representation of the subject identity. By sampling from this feature, we jointly decode personalized canonical avatar mesh $\mathbf{M}$, skinning weights $\mathbf{W}$ and pose-dependent vertex displacement ${\Delta} \mathbf{V}$ (Section \ref{['sec32']}) from a canonical tetrahedral grid. Finally, we adopt a multi-stage training process to train the model with posed-space ground truth and canonical-space regularization (Section \ref{['sec33']}).
  • Figure 3: Qualitative Comparison. Our method produces superior animation quality when reposed to an unseen pose for challenging poses, body shapes and cloth types, which reduces deformation artifacts, e.g. stretched triangles, and generates plausible wrinkles.
  • Figure 4: Method Generalizability. We show the pretrained universal model can directly apply to causally taken photos and synthetic images from Renderpeople rp, which demonstrates its practical applications. When applied to phone photos, we do not require perfect alignment of front and back views and use estimated poses from monocular images for canonicalization. More details are in appendix.
  • Figure 5: Effects of multi-frame aggregation. Given a set of unposed normal frames from different poses in the left, we show results of fused canonical shapes using the first $N$ frames at each column in the right. we observe that aggregation from multiple frames produces more plausible canonical shapes by correcting unposing artifacts, e.g. on skirts and hairs, while preserving person-specific details.
  • ...and 17 more figures