Generalizable Neural Human Renderer

Mana Masuda; Jinhyung Park; Shun Iwase; Rawal Khirodkar; Kris Kitani

Generalizable Neural Human Renderer

Mana Masuda, Jinhyung Park, Shun Iwase, Rawal Khirodkar, Kris Kitani

TL;DR

The paper tackles the problem of animatable human rendering from monocular video without subject-specific test-time optimization. It introduces the Generalizable Neural Human Renderer (GNH), a three-stage pipeline that extracts appearance features from 2D views, lifts them into 3D using explicit SMPL priors, maps features to a target pose, and fuses information from multiple source frames through a multi-frame fusion transformer before rendering with a CNN-based network. The method is trained with a composite objective L = $\lambda_1 L_{color} + \lambda_2 L_{LPIPS} + \lambda_3 L_{adv} + \lambda_4 L_{ab}$ and evaluated on ZJU-MoCap, People Snapshot, and AIST++ datasets, where it achieves substantial LPIPS improvements (e.g., up to $31.5\%$ over GHuNeRF and up to $45.2\%$ on AIST++) and faster rendering speeds (2–7x) compared to prior generalizable methods. Overall, GNH delivers high-fidelity, generalizable animatable human rendering from monocular video, enabling rapid deployment without per-subject optimization, though it relies on accurate pose/mask estimates and static lighting for best results.

Abstract

While recent advancements in animatable human rendering have achieved remarkable results, they require test-time optimization for each subject which can be a significant limitation for real-world applications. To address this, we tackle the challenging task of learning a Generalizable Neural Human Renderer (GNH), a novel method for rendering animatable humans from monocular video without any test-time optimization. Our core method focuses on transferring appearance information from the input video to the output image plane by utilizing explicit body priors and multi-view geometry. To render the subject in the intended pose, we utilize a straightforward CNN-based image renderer, foregoing the more common ray-sampling or rasterizing-based rendering modules. Our GNH achieves remarkable generalizable, photorealistic rendering with unseen subjects with a three-stage process. We quantitatively and qualitatively demonstrate that GNH significantly surpasses current state-of-the-art methods, notably achieving a 31.3% improvement in LPIPS.

Generalizable Neural Human Renderer

TL;DR

and evaluated on ZJU-MoCap, People Snapshot, and AIST++ datasets, where it achieves substantial LPIPS improvements (e.g., up to

over GHuNeRF and up to

on AIST++) and faster rendering speeds (2–7x) compared to prior generalizable methods. Overall, GNH delivers high-fidelity, generalizable animatable human rendering from monocular video, enabling rapid deployment without per-subject optimization, though it relies on accurate pose/mask estimates and static lighting for best results.

Abstract

Paper Structure (17 sections, 12 equations, 12 figures, 6 tables)

This paper contains 17 sections, 12 equations, 12 figures, 6 tables.

Introduction
Related Works
Novel View Synthesis for Humans
Generalizable Novel View Synthesis
Generalizable Novel View Synthesis for humans
Method: Generalizable Neural Human Renderer
Source Feature Extraction
Source-to-Target Feature Mapping
Multi-Frame Aggregation and Rendering
Optimizing a Generalizable Neural Human Renderer
Experiments
Experimental Setup
Result
Ablations and Other Analysis
Discussion and Conclusion
...and 2 more sections

Figures (12)

Figure 1: Given only a monocular video as input, our novel generalizable human rendering framework outputs high-fidelity animatable human rendering without any test-time optimization.
Figure 2: Overview of our Generalizable Neural Human Renderer (GNH). Stage 1 - 3D Source feature extraction obtains 3D source feature using 2D feature extraction and lifting it to 3D using the body mesh vertices and camera parameters. Stage 2 - Source to target mapping converts the 3D source features to the target domain and projects them into 2D. Stage 3 - Multi-frame aggregation and rendering consolidates information from all the source frames using multi-view geometry and renders the target image using a CNN-based renderer.
Figure 3: Qualitative comparison of animatable rendering of unseen identity on the ZJU-MoCap dataset peng2021neural.
Figure 4: Qualitative comparison of animatable rendering of unseen identity on the People Snapshot dataset alldieck2018video.
Figure 5: Qualitative comparison of animatable rendering of unseen identity on the AIST++ dataset Li2021aist.
...and 7 more figures

Generalizable Neural Human Renderer

TL;DR

Abstract

Generalizable Neural Human Renderer

Authors

TL;DR

Abstract

Table of Contents

Figures (12)