Table of Contents
Fetching ...

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

Jaehyeok Kim, Dongyoon Wee, Dan Xu

TL;DR

MoCo-NeRF addresses the challenge of free-viewpoint rendering of dynamic humans from monocular video by modeling non-rigid, pose-dependent motions as radiance residuals atop a rigid canonical radiance field. It decomposes radiance into a rigid branch and a non-rigid residual branch, learned in radiance space, and employs a cross-attention based pose-embedded implicit feature to condition residuals. The approach enables unified multi-subject training via global and local multiresolution hash encoders and an ID-code dictionary, achieving state-of-the-art results on ZJU-MoCap and MonoCap with markedly improved training efficiency. This radiance-compositional framework advances practical monocular dynamic human rendering by delivering high-fidelity, pose-aware non-rigid motion modeling across single and multiple subjects without heavy supervision or SMPL dependencies.

Abstract

This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available at the project page: https://stevejaehyeok.github.io/publications/moco-nerf.

Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

TL;DR

MoCo-NeRF addresses the challenge of free-viewpoint rendering of dynamic humans from monocular video by modeling non-rigid, pose-dependent motions as radiance residuals atop a rigid canonical radiance field. It decomposes radiance into a rigid branch and a non-rigid residual branch, learned in radiance space, and employs a cross-attention based pose-embedded implicit feature to condition residuals. The approach enables unified multi-subject training via global and local multiresolution hash encoders and an ID-code dictionary, achieving state-of-the-art results on ZJU-MoCap and MonoCap with markedly improved training efficiency. This radiance-compositional framework advances practical monocular dynamic human rendering by delivering high-fidelity, pose-aware non-rigid motion modeling across single and multiple subjects without heavy supervision or SMPL dependencies.

Abstract

This paper introduces Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework designed to perform free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. In the context of dynamic clothed humans, complex cloth dynamics generate non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for the rendering quality. The conventional approach models non-rigid motions as spatial (3D) deviations in addition to skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity without a direct supervision. To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions. Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available at the project page: https://stevejaehyeok.github.io/publications/moco-nerf.
Paper Structure (21 sections, 10 equations, 12 figures, 9 tables)

This paper contains 21 sections, 10 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: a) We introduce a motion-oriented compositional NeRF for photo-realistic modeling of dynamic humans from monocular videos. Our approach innovatively employs radiance compositions to capture pose-adaptive non-rigid motions, overcoming the limitations of skeletal transformations that typically yield an average of observed deformations (mean motions). b) The proposed MoCo-NeRF achieves state-of-the-art rendering quality and noteworthy efficiency in novel view synthesis, compared to leading competitors hu2024gauhumangeng2023instantnvrweng2022humannerf. The bold denotes the proposed training duration of each comparison model. MoCo-NeRF uniquely captures coherent non-rigid motions, like T-shirt wrinkles relative to body pose, from entirely new viewpoints. Moreover, MoCo-NeRF significantly surpasses the efficiency of another photo-realistic method, HumanNeRF weng2022humannerf.
  • Figure 2: Overview of the proposed framework MoCo-NeRF for free-view human rendering from a monocular video. Without estimating a geometrical offset of each continuous canonical point for each body pose, our framework is able to handle all deformations via its radiance-compositional approach with a single MHE mueller2022instant and achieve state-of-the-art performance. The pose-embedded implicit feature further enhances the learning of non-rigid radiance residuals by enabling pose-distinctive representation learning.
  • Figure 3: Illustration of the proposed pose-embedded implicit feature generation. We employ cross-attention to modulate the single learnable base code to pose-adaptive features.
  • Figure 4: Illustration of extended architecture of MoCo-NeRF for the multi-subject unified training (Sec. \ref{['method:multiSubjects']}). Major components consist of the global MHE, set of local MHEs, and the dictionary of learnable base codes as ID codes.
  • Figure 5: Qualitative comparison of novel view synthesis from single-subject training. MoCo-NeRF achieves photo-realistic rendering qualities with high-fidelity non-rigid motion modeling. Although HumanNeRF weng2022humannerf presents comparable quality, however, MoCo-NeRF achieves much less training time as shown in Tab. \ref{['tab:efficiency']}.
  • ...and 7 more figures