Table of Contents
Fetching ...

PuzzleAvatar: Assembling 3D Avatars from Personal Albums

Yuliang Xiu, Yufei Ye, Zhen Liu, Dimitrios Tzionas, Michael J. Black

TL;DR

PuzzleAvatar tackles reconstructing personalized 3D avatars from casual OOTD photo albums by avoiding explicit body and camera pose estimation. It learns asset-level tokens for identity, garments, hairstyles, and accessories through fine-tuning a foundation vision-language model, enabling compositional assembly of a 3D avatar via Score Distillation Sampling on a DMTet representation. Central to the approach are PuzzleBooth, which creates tokenized puzzle pieces from unconstrained images, and PuzzleIOI, a new dataset with ground-truth 3D scans for robust benchmarking. Empirical results show competitive 3D reconstruction accuracy against TeCH and MVDreamBooth and superior 2D texture quality, with strong versatility to album-scale inputs and downstream tasks like virtual try-on. The work also discusses limitations, including hallucination risks and training cost, and outlines avenues for faster, more robust, and multi-subject capable expansions.

Abstract

Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose at https://puzzleavatar.is.tue.mpg.de/

PuzzleAvatar: Assembling 3D Avatars from Personal Albums

TL;DR

PuzzleAvatar tackles reconstructing personalized 3D avatars from casual OOTD photo albums by avoiding explicit body and camera pose estimation. It learns asset-level tokens for identity, garments, hairstyles, and accessories through fine-tuning a foundation vision-language model, enabling compositional assembly of a 3D avatar via Score Distillation Sampling on a DMTet representation. Central to the approach are PuzzleBooth, which creates tokenized puzzle pieces from unconstrained images, and PuzzleIOI, a new dataset with ground-truth 3D scans for robust benchmarking. Empirical results show competitive 3D reconstruction accuracy against TeCH and MVDreamBooth and superior 2D texture quality, with strong versatility to album-scale inputs and downstream tasks like virtual try-on. The work also discusses limitations, including hallucination risks and training cost, and outlines avenues for faster, more robust, and multi-subject capable expansions.

Abstract

Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose at https://puzzleavatar.is.tue.mpg.de/
Paper Structure (15 sections, 5 equations, 11 figures, 3 tables)

This paper contains 15 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Image settings for avatar creation. Past work (left) requires images with full-body visibility, known camera calibration, or simple human poses. PuzzleAvatar operates on in-the-wild photos (right); it assumes a consistent outfit, hairstyle and accessories, but deals with unconstrained human poses, camera settings, lighting and background.Our PuzzleIOI dataset contains multi-view images with challenging crops paired with T-pose 3D ground truth.
  • Figure 2: Overview of PuzzleAvatar.The upper figure shows the two main stages: (1) PuzzleBooth (\ref{['sec:puzzlebooth']}), where the unconstrained photo collections are captioned and segmented to create personalized puzzle pieces, for training PuzzleBooth ($G_\text{puzzle}$), and (2) Create-3D-Avatar (\ref{['sec:sds']}), where the T-posed textured tetrahedral body mesh is optimized using a multi-view SDS loss, $\mathcal{L}_\mathrm{SDS}$ (\ref{['eq:nsfd']}). The bottom figure illustrates the training details of PuzzleBooth; the Text-Encoder and the UNet of T2I Diffuser (i.e., Stable Diffusion) are fine-tuned using the masked diffusion loss, $\mathcal{L}_\mathrm{rec}$ (\ref{['eq:rec_loss']}), cross-attention loss, $\mathcal{L}_\mathrm{attn}$ (\ref{['eq:attention_loss']}), and prior preservation loss, $\mathcal{L}_\mathrm{prior}$ (\ref{['eq:prior_loss']}). Components marked in light blue are trainable or optimizable.
  • Figure 3: Color-Normal Synthetic Prior. The descriptions are generated via GPT-4V gpt4v, where the prompt of the RGB image starts with "a high-resolution DSLR colored image", while that of the normal image starts with "a detailed sculpture of" The zoomed-in head images are generated by appending "the headshot of".
  • Figure 4: Effectiveness of Synthetic Priors. All the numbers refer to the performance gain (%), where Full means training with color-normal pairs, and RGB and Normal means training with a single modality.
  • Figure 5: Failure Cases. Non-human artifacts are a common error for TeCH (see left part), whereas errors in PuzzleAvatar stem from hallucination and flawed DMTet modeling of thin structures. For the right-top case, the black pants showing through the white coat, while realistic, deviate from the original input. As a result of this hallucination, the failures of PuzzleAvatar cause differences from the ground-truth, but not completely catastrophic (see right part).
  • ...and 6 more figures