Table of Contents
Fetching ...

UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Zeyu Cai, Ziyang Li, Xiaoben Li, Boqian Li, Zeyu Wang, Zhenyu Zhang, Yuliang Xiu

Abstract

We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Abstract

We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well-calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module (PCFA), that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. We also introduce a perceiver-based multi-reference shape predictor, removing the need for pre-captured body templates. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%, P2S-18% on PuzzleIOI) and texture fidelity (PSNR-21%, LPIPS-46% on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task. Project Page: https://zcai0612.github.io/UP2You

Paper Structure

This paper contains 35 sections, 5 equations, 33 figures, 9 tables.

Figures (33)

  • Figure 1: Overview of UP2You. Our method reconstructs high-quality, textured 3D clothed portraits from unconstrained photo collections. It robustly handles highly diverse and unstructured inputs by rectifying them into orthogonal multi-view images and corresponding normal maps, making them compatible with traditional reconstruction algorithms.
  • Figure 2: Paradigm differences between previous works and UP2You.Top: Previous works like PuzzleAvatar xiu2024puzzleavatar and AvatarBooth zeng2023avatarbooth compress unconstrained photos into implicit personal tokens and DreamBooth weights ruiz2023dreambooth through fine-tuning, then generate 3D humans via SDS optimization ruiz2023dreambooth. Bottom: UP2You directly rectifies unconstrained photo collections into orthogonal view images and normals, then reconstructs textured human meshes, achieving superior quality while reducing processing time from 4 hours to 1.5 minutes.
  • Figure 3: Pipeline of UP2You. Given unconstrained input photos $\mathbf{I}$, we first predict the SMPL-X shape parameters (\ref{['sec:method_shape']}) and initialize the SMPL-X mesh with predefined pose and expression parameters. We then generate orthogonal view images $\mathbf{V}$ based on $\mathbf{I}$ and SMPL-X normal rendering $\mathbf{P}$ with the proposed PCFA method---predict correlation maps $\mathbf{C}$ and select most informative features (\ref{['sec:method_rgb_gen']}). Finally, we produce multi-view normal maps $\mathbf{N}$ from $\mathbf{P}$ and $\mathbf{V}$, and reconstruct the final textured mesh (\ref{['sec:method_normal_mesh']}).
  • Figure 4: Pose-Dependent Correlation Map. Correlation is colored as Higher$\rightarrow$Lower.
  • Figure 5: Normal Map Generation Pipeline. The main input difference with \ref{['fig:pipline_rgb']} is the generated multi-view orthogonal images $\hbox{$\mathbf{V}$}\xspace$, instead of unconstrained inputs $\hbox{$\mathbf{I}$}\xspace$.
  • ...and 28 more figures