Generalizable Neural Human Renderer
Mana Masuda, Jinhyung Park, Shun Iwase, Rawal Khirodkar, Kris Kitani
TL;DR
The paper tackles the problem of animatable human rendering from monocular video without subject-specific test-time optimization. It introduces the Generalizable Neural Human Renderer (GNH), a three-stage pipeline that extracts appearance features from 2D views, lifts them into 3D using explicit SMPL priors, maps features to a target pose, and fuses information from multiple source frames through a multi-frame fusion transformer before rendering with a CNN-based network. The method is trained with a composite objective L = $\lambda_1 L_{color} + \lambda_2 L_{LPIPS} + \lambda_3 L_{adv} + \lambda_4 L_{ab}$ and evaluated on ZJU-MoCap, People Snapshot, and AIST++ datasets, where it achieves substantial LPIPS improvements (e.g., up to $31.5\%$ over GHuNeRF and up to $45.2\%$ on AIST++) and faster rendering speeds (2–7x) compared to prior generalizable methods. Overall, GNH delivers high-fidelity, generalizable animatable human rendering from monocular video, enabling rapid deployment without per-subject optimization, though it relies on accurate pose/mask estimates and static lighting for best results.
Abstract
While recent advancements in animatable human rendering have achieved remarkable results, they require test-time optimization for each subject which can be a significant limitation for real-world applications. To address this, we tackle the challenging task of learning a Generalizable Neural Human Renderer (GNH), a novel method for rendering animatable humans from monocular video without any test-time optimization. Our core method focuses on transferring appearance information from the input video to the output image plane by utilizing explicit body priors and multi-view geometry. To render the subject in the intended pose, we utilize a straightforward CNN-based image renderer, foregoing the more common ray-sampling or rasterizing-based rendering modules. Our GNH achieves remarkable generalizable, photorealistic rendering with unseen subjects with a three-stage process. We quantitatively and qualitatively demonstrate that GNH significantly surpasses current state-of-the-art methods, notably achieving a 31.3% improvement in LPIPS.
