LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh
Jing Wen, Alexander G. Schwing, Shenlong Wang
TL;DR
We address the challenge of generalizable human rendering from sparse inputs by learning priors in a data-driven pipeline that avoids per-scene optimization. The method, LIFe-GoM, combines a dual Gaussians-on-Mesh representation with an iterative feedback reconstruction that refines a canonical GoM in T-pose and a coupled high-resolution Gaussian layer for high-fidelity rendering. Rendering reaches high frame rates (95 FPS at $1024\times1024$) while reconstruction remains under $1$s, achieving state-of-the-art PSNR, LPIPS*, and FID on THuman2.0 and strong cross-domain results on XHuman and AIST++ . The approach supports pose-conditioned rendering and novel pose synthesis without postprocessing, with robust performance across sparse multiview and multi-frame inputs, making it practical for real-time VR/AR applications.
Abstract
Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at $1024 \times 1024$, and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.
