Table of Contents
Fetching ...

Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video

Yuchen Rao, Eduardo Perez Pellitero, Benjamin Busam, Yiren Zhou, Jifei Song

TL;DR

ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video that outperforms previous monocular approaches in terms of image quality for image synthesis tasks and offers an intuitive user interface for creative manipulation of 3D human avatars.

Abstract

Recent advancements in 3D avatar generation excel with multi-view supervision for photorealistic models. However, monocular counterparts lag in quality despite broader applicability. We propose ReCaLaB to close this gap. ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video. A pose-conditioned deformable NeRF is optimized to volumetrically represent a human subject in canonical T-pose. The canonical representation is then leveraged to efficiently associate neural textures using 2D-3D correspondences. This enables the separation of diffused color generation and lighting correction branches that jointly compose an RGB prediction. The design allows to control intermediate results for human pose, body shape, texture, and lighting with text prompts. An image-conditioned diffusion model thereby helps to animate appearance and pose of the 3D avatar to create video sequences with previously unseen human motion. Extensive experiments show that ReCaLaB outperforms previous monocular approaches in terms of image quality for image synthesis tasks. Moreover, natural language offers an intuitive user interface for creative manipulation of 3D human avatars.

Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video

TL;DR

ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video that outperforms previous monocular approaches in terms of image quality for image synthesis tasks and offers an intuitive user interface for creative manipulation of 3D human avatars.

Abstract

Recent advancements in 3D avatar generation excel with multi-view supervision for photorealistic models. However, monocular counterparts lag in quality despite broader applicability. We propose ReCaLaB to close this gap. ReCaLaB is a fully-differentiable pipeline that learns high-fidelity 3D human avatars from just a single RGB video. A pose-conditioned deformable NeRF is optimized to volumetrically represent a human subject in canonical T-pose. The canonical representation is then leveraged to efficiently associate neural textures using 2D-3D correspondences. This enables the separation of diffused color generation and lighting correction branches that jointly compose an RGB prediction. The design allows to control intermediate results for human pose, body shape, texture, and lighting with text prompts. An image-conditioned diffusion model thereby helps to animate appearance and pose of the 3D avatar to create video sequences with previously unseen human motion. Extensive experiments show that ReCaLaB outperforms previous monocular approaches in terms of image quality for image synthesis tasks. Moreover, natural language offers an intuitive user interface for creative manipulation of 3D human avatars.
Paper Structure (31 sections, 11 equations, 12 figures, 5 tables)

This paper contains 31 sections, 11 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of ReCaLaB. We propose ReCaLaB, a novel approach for photo-realistic 3D avatar creation from monocular video and different types of manipulations. Our approach models neural humans with a disentangled neural texture field and enables different types of manipulation, including novel view rendering, novel pose animation, and novel appearance with customized shape, texture, and illumination by language brush.
  • Figure 2: Framework of ReCaLaB. Our approach takes monocular video frames as the input to reconstruct the human avatar. We first learn the backward deformation $T$ of the human body in module (a) with the human pose. Then in module (b), we generate a corresponding UVS map by using a volume generator $G_c^F$ initialized by the SMPL T-pose prior, followed with different mapping functions $G_c^M, (M \in \{U, V, S\})$. We then designed a texture generator $H_c^T$ along with the diffused color branch $H_c^A$, and the lighting correction branch $H_c^N$ for the neural texture learning in module (c). Finally, in module (d), shape, texture, and lighting correction modules can be updated accordingly with the instruction from the language brush.
  • Figure 3: Qualitative results for human rendering on ZJU-MoCap dataset peng2021neural for novel view. HumanNeRF weng2022humannerf suffers from ghosting artifacts (see marked regions) while deformation artifacts arise in UV volumes chen2023uv due to the texture-driven training. Our approach copes with both these issues providing results visually closer to the ground truth.
  • Figure 4: Qualitative results for human rendering on ZJU-MoCap dataset peng2021neural for novel pose. Our method yields the most superior texture outcomes.
  • Figure 5: Qualitative results on novel poses guided by text instruction. ReCaLaB provides sharper texture and fewer geometric artifacts compared to HumanNeRF weng2022humannerf.
  • ...and 7 more figures