Table of Contents
Fetching ...

LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh

Jing Wen, Alexander G. Schwing, Shenlong Wang

TL;DR

We address the challenge of generalizable human rendering from sparse inputs by learning priors in a data-driven pipeline that avoids per-scene optimization. The method, LIFe-GoM, combines a dual Gaussians-on-Mesh representation with an iterative feedback reconstruction that refines a canonical GoM in T-pose and a coupled high-resolution Gaussian layer for high-fidelity rendering. Rendering reaches high frame rates (95 FPS at $1024\times1024$) while reconstruction remains under $1$s, achieving state-of-the-art PSNR, LPIPS*, and FID on THuman2.0 and strong cross-domain results on XHuman and AIST++ . The approach supports pose-conditioned rendering and novel pose synthesis without postprocessing, with robust performance across sparse multiview and multi-frame inputs, making it practical for real-time VR/AR applications.

Abstract

Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at $1024 \times 1024$, and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.

LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh

TL;DR

We address the challenge of generalizable human rendering from sparse inputs by learning priors in a data-driven pipeline that avoids per-scene optimization. The method, LIFe-GoM, combines a dual Gaussians-on-Mesh representation with an iterative feedback reconstruction that refines a canonical GoM in T-pose and a coupled high-resolution Gaussian layer for high-fidelity rendering. Rendering reaches high frame rates (95 FPS at ) while reconstruction remains under s, achieving state-of-the-art PSNR, LPIPS*, and FID on THuman2.0 and strong cross-domain results on XHuman and AIST++ . The approach supports pose-conditioned rendering and novel pose synthesis without postprocessing, with robust performance across sparse multiview and multi-frame inputs, making it practical for real-time VR/AR applications.

Abstract

Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at , and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.

Paper Structure

This paper contains 27 sections, 11 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview. We tackle the problem of generalizable human rendering. Given sparse source images (multiview images or multi-frame images), we reconstruct the 3D human representation in canonical T-pose space. The canonical representation can be animated and rendered in novel views.
  • Figure 2: Multi-resolution Gaussians-on-Mesh representation. We use a low-res mesh for faster animation and simpler geometry and attach Gaussians on a high-res mesh for better rendering.
  • Figure 3: Iterative feedback. We iteratively update in a feed-forward way the vertices of the low-resolution mesh and the Gaussian parameters attached to the high-resolution mesh. We repeat the update for $T$ steps. Each step $t$ operates on the source images, camera parameters and human poses, as well as the last iteration's results including the canonical representation $\text{GoM}_{t-1}^c$ and the predicted source images rendered by $\text{GoM}_{t-1}^c$ (the brown arrows).
  • Figure 4: Cross-domain generalization on DNA-Rendering dataset w/o finetuning.
  • Figure 5: Novel pose synthesis. Poses are from BEDLAM dataset.
  • ...and 8 more figures