Table of Contents
Fetching ...

Animatable Neural Radiance Fields from Monocular RGB Videos

Jianchuan Chen, Ying Zhang, Di Kang, Xuefei Zhe, Linchao Bao, Xu Jia, Huchuan Lu

TL;DR

This work tackles reconstructing and animating realistic 3D human avatars from monocular RGB videos. It introduces Animatable Neural Radiance Fields (animatable NeRF) that explicitly deform observations into a canonical space using SMPL, enabling high-detail, view-consistent rendering and novel-pose animation. A joint optimization with pose refinement (analysis-by-synthesis) robustly corrects SMPL estimates during training, improving geometry and appearance while accelerating convergence. Across real and synthetic datasets, the method achieves superior novel-view synthesis, accurate 3D reconstruction, and controllable novel-pose rendering, highlighting its potential for accessible, avatar-based applications from simple video input.

Abstract

We present animatable neural radiance fields (animatable NeRF) for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from novel views, and 3) animation of the human with novel poses.

Animatable Neural Radiance Fields from Monocular RGB Videos

TL;DR

This work tackles reconstructing and animating realistic 3D human avatars from monocular RGB videos. It introduces Animatable Neural Radiance Fields (animatable NeRF) that explicitly deform observations into a canonical space using SMPL, enabling high-detail, view-consistent rendering and novel-pose animation. A joint optimization with pose refinement (analysis-by-synthesis) robustly corrects SMPL estimates during training, improving geometry and appearance while accelerating convergence. Across real and synthetic datasets, the method achieves superior novel-view synthesis, accurate 3D reconstruction, and controllable novel-pose rendering, highlighting its potential for accessible, avatar-based applications from simple video input.

Abstract

We present animatable neural radiance fields (animatable NeRF) for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from novel views, and 3) animation of the human with novel poses.

Paper Structure

This paper contains 22 sections, 10 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Overview of the proposed Animatable Neural Radiance Fields. Given a video sequence, we estimate the camera $K_t$ and SMPL parameters $M(\theta_{t}, \beta_{t})$ of the human subject for initialization. We use volume rendering to sample points $(x_t, y_t, z_t)$ along the camera ray in observation space, and transform these points to canonical space according to pose-guided deformation. Then we input these points $(x_t^0, y_t^0, z_t^0)$ into the neural radiance field to get densities $\sigma$ and colors $\mathbf{c}$. Then we use the integral equation Eq. \ref{['eq:rendering equation']} to render the image, and jointly optimize the neural radiance field parameters $\phi$ and SMPL parameters $\theta_t,\beta_t$ by minimizing the error $\mathcal{L}\left(\tilde{I}_t, I_t\right)$ between the rendered image $\tilde{I}_t$ and the ground truth image $I_t$ with the mask.
  • Figure 2: Visual comparison of different methods about novel view synthesis on People-snapshotVideo_avatars(1-2 rows) and iPERLWGAN(3-4 rows). NeRFNeRF is struggling to handle dynamic scenes because the movement of the subject violates the multi-view consistency requirement. With the help of our proposed pose-guide deformation, NeRF+U (NeRF + Unpose) achieves much better results (row 1&2) if the estimated SMPL poses are accurate but still produces blurry results (row 3&4) if they are not. Further adding pose refinement (ours) greatly improves the robustness as long as the estimated SMPL pose is reasonably good. Compared with NeuralBodyNeural_Body and SMPLpixsmplpix, our approach can produce realistic images with well preserved identity and cloth details.
  • Figure 3: Results of Novel View Synthesis on iPER (a-d) and People-Snapshot (e-h). Our method can synthesize realistic and multi-view consistent results from different camera views while maintaining the subject pose fixed.
  • Figure 4: Visualization of 3D reconstruction on Multi-Garment. NeRFNeRF and NeRF+U (NeRF + Unpose) fail to reconstruct 3D geometry due to the movement of the subject and the inaccurate SMPL. Compared with NeRF+L (NeRF + Latent) which produces over-smooth or under-smooth results, our results are more reasonable. As a reference, NeRF+U(GT) uses GT SMPL and learns geometry with very high precision, demonstrating the effectiveness of our pose-guided deformation and showing the importance of obtaining accurate SMPL for 3D reconstruction tasks.
  • Figure 5: Comparisons of 3D reconstruction results on People-Snapshot with video avatars Video_avatars. Compared with Video AvatarsVideo_avatars, our approach can generate more details such as hairs and clothes wrinkles.
  • ...and 9 more figures