Table of Contents
Fetching ...

HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

Mustafa Işık, Martin Rünz, Markos Georgopoulos, Taras Khakhulin, Jonathan Starck, Lourdes Agapito, Matthias Nießner

TL;DR

HumanRF addresses the challenge of high-fidelity free-viewpoint synthesis for moving humans by representing appearance and motion as a temporally segmented 4D radiance field learned from 160-camera 12MP multi-view data. It introduces a 4D feature-grid decomposition built from 3D hash grids and 1D dense grids, coupled with adaptive temporal partitioning to enable long sequences within practical memory limits. The paper also introduces ActorsHQ, a high-resolution multi-view dataset with per-frame meshes, and demonstrates substantial improvements over state-of-the-art baselines in both qualitative and quantitative metrics, including full 12MP rendering. Together, these contributions advance production-level quality for neural human rendering and provide resources to the community for further research.

Abstract

Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.

HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion

TL;DR

HumanRF addresses the challenge of high-fidelity free-viewpoint synthesis for moving humans by representing appearance and motion as a temporally segmented 4D radiance field learned from 160-camera 12MP multi-view data. It introduces a 4D feature-grid decomposition built from 3D hash grids and 1D dense grids, coupled with adaptive temporal partitioning to enable long sequences within practical memory limits. The paper also introduces ActorsHQ, a high-resolution multi-view dataset with per-frame meshes, and demonstrates substantial improvements over state-of-the-art baselines in both qualitative and quantitative metrics, including full 12MP rendering. Together, these contributions advance production-level quality for neural human rendering and provide resources to the community for further research.

Abstract

Representing human performance at high-fidelity is an essential building block in diverse applications, such as film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce HumanRF, a 4D dynamic neural scene representation that captures full-body appearance in motion from multi-view video input, and enables playback from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing high-resolution details even in the context of challenging motion. While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of operating at 12MP. To this end, we introduce ActorsHQ, a novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity, per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and show that our newly introduced HumanRF effectively leverages this data, making a significant step towards production-level quality novel view synthesis.
Paper Structure (24 sections, 11 equations, 10 figures, 4 tables)

This paper contains 24 sections, 11 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Overview of HumanRF: Prior to training, our method starts by splitting the temporal domain into 4D segments with similar union occupancy in 3D (§\ref{['sec:adaptive_temporal_partitioning']}). Each segment is modeled by a 4D feature grid which is compactly represented by utilizing tensor decomposition and hash grids (§\ref{['sec:feature_grid_decomposition']}). During training, we sample a batch of rays across different time frames and cameras. After each pixel color is predicted via volume rendering (§\ref{['sec:shared_mlps_and_rendering']}), we enforce photometric constraints and regularize ray marching weights via foreground masks (§\ref{['sec:losses']}).
  • Figure 2: Fixed-segment size vs. adaptive partitioning Using a single 4D representation for an entire sequence (segment size 400) or using a 3D hash grid per frame (segment size 1), give poor results. We observe that finding the middle ground (segment sizes from 3 to 100) leads to better results (a, b, c). Sequences with moderate motions favor larger segment sizes whereas those with stronger motions favor smaller ones (b). Our Adaptive Temporal Partitioning scheme (§\ref{['sec:adaptive_temporal_partitioning']}) avoids the costly hyper-parameter search for the optimal, global segment size, and leads to results close to those of optimal segment sizes (a, b). On average, our adaptive method is better than using any fixed segment size (c). These experiments are performed on 400-frame sequences using shared MLPs. The total number of parameters is kept approximately the same while varying the segment size.
  • Figure 3:
  • Figure 4: Actors and meshes. Our ActorsHQ dataset contains 8 actors with casual clothing such as skirts or shorts. Each sequence is captured by 160 camera, each recording at 12MP. In addition to the recorded images, we also provide high-quality, per-frame mesh reconstructions with approximately 500k vertices.
  • Figure 5: Comparison with human-specific methods. Although peng2021neuralli2022tava incorporate geometry and pose information, they fail to capture fine details, and produce blurrier results compared to HumanRF.
  • ...and 5 more figures