Table of Contents
Fetching ...

HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses

Caoyuan Ma, Yu-Lun Liu, Zhixiang Wang, Wu Liu, Xinchen Liu, Zheng Wang

TL;DR

HumanNeRF-SE presents a streamlined architecture that fuses explicit SMPL priors with implicit NeRF to animate humans across diverse poses from monocular, few-shot inputs. By voxelizing SMPL space, applying Conv-Filter to prune irrelevant points, and refining point-wise canonical coordinates with spatial-aware features, the method achieves strong pose generalization while dramatically reducing learnable parameters and training time. The approach delivers high-quality renderings with fewer artifacts than prior methods, and demonstrates notable speedups without external acceleration modules. Its reliance on readily available SMPL information and a simple yet effective design makes it practical for industrial video production and real-time-style applications, especially under limited data regimes.

Abstract

We present HumanNeRF-SE, a simple yet effective method that synthesizes diverse novel pose images with simple input. Previous HumanNeRF works require a large number of optimizable parameters to fit the human images. Instead, we reload these approaches by combining explicit and implicit human representations to design both generalized rigid deformation and specific non-rigid deformation. Our key insight is that explicit shape can reduce the sampling points used to fit implicit representation, and frozen blending weights from SMPL constructing a generalized rigid deformation can effectively avoid overfitting and improve pose generalization performance. Our architecture involving both explicit and implicit representation is simple yet effective. Experiments demonstrate our model can synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time.

HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses

TL;DR

HumanNeRF-SE presents a streamlined architecture that fuses explicit SMPL priors with implicit NeRF to animate humans across diverse poses from monocular, few-shot inputs. By voxelizing SMPL space, applying Conv-Filter to prune irrelevant points, and refining point-wise canonical coordinates with spatial-aware features, the method achieves strong pose generalization while dramatically reducing learnable parameters and training time. The approach delivers high-quality renderings with fewer artifacts than prior methods, and demonstrates notable speedups without external acceleration modules. Its reliance on readily available SMPL information and a simple yet effective design makes it practical for industrial video production and real-time-style applications, especially under limited data regimes.

Abstract

We present HumanNeRF-SE, a simple yet effective method that synthesizes diverse novel pose images with simple input. Previous HumanNeRF works require a large number of optimizable parameters to fit the human images. Instead, we reload these approaches by combining explicit and implicit human representations to design both generalized rigid deformation and specific non-rigid deformation. Our key insight is that explicit shape can reduce the sampling points used to fit implicit representation, and frozen blending weights from SMPL constructing a generalized rigid deformation can effectively avoid overfitting and improve pose generalization performance. Our architecture involving both explicit and implicit representation is simple yet effective. Experiments demonstrate our model can synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies, HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time.
Paper Structure (38 sections, 14 equations, 17 figures, 10 tables)

This paper contains 38 sections, 14 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Overview. HumanNeRF-SE efficiently synthesizes images of performers in diverse poses, blending simplicity with effectiveness. It outperforms previous methods by creating a wider range of new poses (a), maintains generalization without overfitting with limited input frames (b), and requires fewer than 1% of learnable parameters, reducing training time by 95% while delivering superior results in the few-shot scenario (c). $^\dagger$LPIPS = 1,000$\times$LPIPS. Project page: https://miles629.github.io/humanNeRF-se.github.io/
  • Figure 2: Different weights for deformation. (a) Prior methods peng2021animatableweng2022humanNeRFyu2023monohumanliu2021neural learn a weight volume for deformation through neural networks or fine-tune blending weights obtained from fitting SMPL to the input frame. The weight volume optimized along with NeRF parameters per human image is prone to over-fitting. When synthesizing novel pose images, the over-fitted weights will deform points onto the canonical space incorrectly and lead to artifacts. (b) Our idea is to use SMPL's blending weights directly because these weights are pre-trained on numerous human images to avoid overfitting. However, simply utilizing the nearest SMPL vertex's blending weights for deformation fills the sampling space with incorrect colors as the training phase deforms irrelevant sampling points onto the human body. (c) We propose to filter irrelevant points according to the human body information of SMPL. This way, we can avoid over-fitting and reduce the number of sampling points.
  • Figure 3: Framework of HumanNeRF-SE. (a) We first voxelize the observation space as a voxel volume $\mathbf{V}$. For a voxel containing vertices, the value will be the number of vertices (as one occupancy channel) and the corresponding SMPL weight. (b) We performed channel-by-channel convolution on the volume. All sampling points are queried in the convolutional volume to get their spatial-aware features. Those points with zero occupancy will be filtered out. (c) We query the nearest weight of the remained points in the volume, which is used for rigid deformation. Spatial-aware features are utilized in the neural network to correct the rigid results and obtain the final point coordinates in the canonical space. The sampling points in the canonical space obtain their colors and densities through the NeRF network. The densities of filtered points are forced to be zero.
  • Figure 4: Qualitative results with few-shot training images. Because of limited information used in training, previous methods weng2022humanNeRFyu2023monohumanpeng2021animatable cannot learn appropriate human weights. The official code of Ani-NeRF peng2021animatable did not produce reasonable results on our data since it is designed for multi-camera input. HumanNeRF weng2022humanNeRF exhibits distortion and artifacts. The performance of Monohuman yu2023monohuman is heavily influenced by the specific data.
  • Figure 5: Rendering results with pose sequences from Subject-387 in ZJU-MoCap. We use all the videos of the performers to train and synthesize images with different pose sequences from Subject-387. The baselines produce noticeable artifacts, while our method maintains high-quality image synthesis.
  • ...and 12 more figures