Table of Contents
Fetching ...

Bringing Your Portrait to 3D Presence

Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu

TL;DR

This work enables animatable 3D avatar reconstruction from a single portrait across head, half-body, and full-body inputs by introducing a Dual-UV, geometry-aligned feature framework, a factorized synthetic data manifold, and a robust proxy-mesh tracker. Training solely on synthetic data, it achieves state-of-the-art results for head and upper-body reconstruction and competitive performance for full-body scenarios, with strong generalization to in-the-wild images. The method emphasizes data scalability and stability through a mask-based reconstruction pipeline, a hybrid data pipeline with realism regularization, and a multi-estimator proxy-mesh tracker, enabling versatile applications like editing and multi-view fusion. Overall, the approach advances single-image 3D avatar reconstruction by combining geometry-consistent UV representations, diverse synthetic data, and robust tracking to handle varying input completeness.

Abstract

We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Bringing Your Portrait to 3D Presence

TL;DR

This work enables animatable 3D avatar reconstruction from a single portrait across head, half-body, and full-body inputs by introducing a Dual-UV, geometry-aligned feature framework, a factorized synthetic data manifold, and a robust proxy-mesh tracker. Training solely on synthetic data, it achieves state-of-the-art results for head and upper-body reconstruction and competitive performance for full-body scenarios, with strong generalization to in-the-wild images. The method emphasizes data scalability and stability through a mask-based reconstruction pipeline, a hybrid data pipeline with realism regularization, and a multi-estimator proxy-mesh tracker, enabling versatile applications like editing and multi-view fusion. Overall, the approach advances single-image 3D avatar reconstruction by combining geometry-consistent UV representations, diverse synthetic data, and robust tracking to handle varying input completeness.

Abstract

We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Paper Structure

This paper contains 46 sections, 22 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Our method uses a dual-UV formulation to represent 3D avatars, enabling reconstruction from full-body, half-body, and headshot portraits while capturing off-body textures. Trained entirely on synthetic data, it generalizes effectively to in-the-wild images.
  • Figure 2: Reconstruction Pipeline. Given a reference image and its tracked proxy mesh, dense features from a frozen encoder are sampled along visible rays and scattered into canonical UV space to form the Core-UV map, while an offset shell captures off-surface regions such as hair and clothing. The Core-UV and Shell-UV tokens are fused and decoded by a lightweight transformer to reconstruct UV-space Gaussian attributes, which are then rigged to a target mesh and rendered from arbitrary viewpoints.
  • Figure 3: Data Curation. We build a hybrid dataset by combining geometry-anchored 3D rendering with semantics-driven generative synthesis. The synthetic rendering branch offers geometry-consistent multi-view supervision through procedural sampling of identity, pose, appearance, illumination, and cameras. The generative branch constructs a factorized appearance manifold by decomposing scene attributes, applying LLM-based filmic refinement, generating photorealistic sequences, and completing each sample with side/back views for weakly correlated augmentation.
  • Figure 4: Reenactment Results. Our method is trained solely on upper-body data only, generalizes well to head and full-body inputs.
  • Figure 5: Novel View Synthesis. Our method generates multi-view human renderings from a single reference image, showing comparatively more consistent appearance, especially in the head and upper-body regions.
  • ...and 13 more figures