Table of Contents
Fetching ...

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

Dianbing Xi, Guoyuan An, Jingsen Zhu, Zhijian Liu, Yuan Liu, Ruiyuan Zhang, Jiayuan Lu, Yuchi Huo, Rui Wang

TL;DR

PFAvatar tackles the challenge of reconstructing textured 3D avatars from unconstrained Outfit of the Day photos by combining a pose-aware diffusion model trained with ControlNet-based pose priors and a CPPL loss, with a NeRF-based avatar distilled via canonical SMPL-X sampling and Multi-Resolution 3D-SDS. This two-stage pipeline avoids asset decomposition, preserves identity, and handles occlusions while delivering rapid personalization (~5 minutes). The NeRF avatar uses Instant-NGP and 3D-aware prompting to maintain high-frequency details and 3D consistency, aided by a Local Geometry Constraint to stabilize local structures. Compared with state-of-the-art methods, PFAvatar achieves higher reconstruction fidelity, robustness to truncations, and better downstream utility for virtual try-on, animation, and video reenactment.

Abstract

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from Outfit of the Day(OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48x speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

PFAvatar: Pose-Fusion 3D Personalized Avatar Reconstruction from Real-World Outfit-of-the-Day Photos

TL;DR

PFAvatar tackles the challenge of reconstructing textured 3D avatars from unconstrained Outfit of the Day photos by combining a pose-aware diffusion model trained with ControlNet-based pose priors and a CPPL loss, with a NeRF-based avatar distilled via canonical SMPL-X sampling and Multi-Resolution 3D-SDS. This two-stage pipeline avoids asset decomposition, preserves identity, and handles occlusions while delivering rapid personalization (~5 minutes). The NeRF avatar uses Instant-NGP and 3D-aware prompting to maintain high-frequency details and 3D consistency, aided by a Local Geometry Constraint to stabilize local structures. Compared with state-of-the-art methods, PFAvatar achieves higher reconstruction fidelity, robustness to truncations, and better downstream utility for virtual try-on, animation, and video reenactment.

Abstract

We propose PFAvatar (Pose-Fusion Avatar), a new method that reconstructs high-quality 3D avatars from Outfit of the Day(OOTD) photos, which exhibit diverse poses, occlusions, and complex backgrounds. Our method consists of two stages: (1) fine-tuning a pose-aware diffusion model from few-shot OOTD examples and (2) distilling a 3D avatar represented by a neural radiance field (NeRF). In the first stage, unlike previous methods that segment images into assets (e.g., garments, accessories) for 3D assembly, which is prone to inconsistency, we avoid decomposition and directly model the full-body appearance. By integrating a pre-trained ControlNet for pose estimation and a novel Condition Prior Preservation Loss (CPPL), our method enables end-to-end learning of fine details while mitigating language drift in few-shot training. Our method completes personalization in just 5 minutes, achieving a 48x speed-up compared to previous approaches. In the second stage, we introduce a NeRF-based avatar representation optimized by canonical SMPL-X space sampling and Multi-Resolution 3D-SDS. Compared to mesh-based representations that suffer from resolution-dependent discretization and erroneous occluded geometry, our continuous radiance field can preserve high-frequency textures (e.g., hair) and handle occlusions correctly through transmittance. Experiments demonstrate that PFAvatar outperforms state-of-the-art methods in terms of reconstruction fidelity, detail preservation, and robustness to occlusions/truncations, advancing practical 3D avatar generation from real-world OOTD albums. In addition, the reconstructed 3D avatar supports downstream applications such as virtual try-on, animation, and human video reenactment, further demonstrating the versatility and practical value of our approach.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Using the "Outfit of the Day" (OOTD) photos from a personal collection (shown in the upper left), our PFAvatar reconstructs a personalized and fully textured 3D NeRF avatar (depicted in the middle). These OOTD photos can vary widely in terms of body poses, scales, camera angles, framing, frequent partial occlusions, or significant truncation. PFAvatar is designed to handle such variability robustly, enabling a range of downstream tasks. These include virtual try-on through text-guided editing, 3D animation, facial animation, and human video reenactment, all while meticulously preserving the subject's identity and unique characteristics.
  • Figure 2: Overview of our PFAvatar pipeline. (1) The top left illustrates the overall flow of our framework, which consists of two main stages: ControlBooth and BoothAvatar. (2) The bottom left displays the training details of the ControlBooth stage. In this stage, our input data is composed of three parts: images, pose-conditioning, and captions. These are used to fine-tune a Pose-Aware Diffusion Model $\mathcal{M}_\text{b}$, where the Text-Encoder and the UNET are trained using the reconstruction diffusion loss, $\mathcal{L}_{\text{rec}}$(\ref{['eq:controlbooth_rec']}), and the condition-based prior preservation loss, $\mathcal{L}_{\text{cppl}}$(\ref{['eq:controlbooth_cppl']}). (3) The right section shows the details of the BoothAvatar stage. In this stage, the avatar is represented as an A-posed canonical avatar. The model $\mathcal{M}_\text{b}$ obtained from the previous stage is used to guide this reconstruction process. Using multi-resolution $\mathcal{L}_{\text{3D-SDS}}$ (\ref{['eq:3d-sds']}), we optimize a NeRF represented by Instant-NGP, with an additional loss $\mathcal{L}_{\text{geo}}$ (\ref{['eq:geo']}) to stabilize local structures during SDS optimization.
  • Figure 3: Balancing diversity and control with Condition Prior-Preservation Loss (CPPL). Using the fine-tuning strategy of Naive DreamBooth (Row 1) to generate images with new poses may introduce color discrepancies, significantly reducing thematic consistency. Training with only \ref{['eq:controlbooth_rec']} may lead to overfitting on the context of the input image and the subject's appearance (e.g., pose). CPPL (Row 3) acts as a regularizer, mitigating overfitting while encouraging diversity and maintaining control.
  • Figure 4: Qualitative Comparison I: Custom Dataset. Visual results on three distinct subjects compare PuzzleAvatar, AvatarBooth, and our method (PFAvatar). Our approach consistently preserves finer details and maintains structural coherence across multiple views, demonstrating clear superiority over the baselines. This highlights PFAvatar's robustness in achieving high-quality, consistent reconstructions under challenging conditions.
  • Figure 5: Qualitative result of personalized diffusion models. Compared to other methods, our Pose-Aware Diffusion Model achieves superior subject consistency while enabling precise control over character poses. The pose illustration in the lower-left corner represents the input control pose used for conditional generation. ✓ indicates support for control pose, while ✗ denotes lack of support.
  • ...and 2 more figures