Table of Contents
Fetching ...

GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

Youngjoong Kwon, Yao He, Heejung Choi, Chen Geng, Zhengmao Liu, Jiajun Wu, Ehsan Adeli

Abstract

We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

GenFusion: Feed-forward Human Performance Capture via Progressive Canonical Space Updates

Abstract

We present a feed-forward human performance capture method that renders novel views of a performer from a monocular RGB stream. A key challenge in this setting is the lack of sufficient observations, especially for unseen regions. Assuming the subject moves continuously over time, we take advantage of the fact that more body parts become observable by maintaining a canonical space that is progressively updated with each incoming frame. This canonical space accumulates appearance information over time and serves as a context bank when direct observations are missing in the current live frame. To effectively utilize this context while respecting the deformation of the live state, we formulate the rendering process as probabilistic regression. This resolves conflicts between past and current observations, producing sharper reconstructions than deterministic regression approaches. Furthermore, it enables plausible synthesis even in regions with no prior observations. Experiments on in-domain (4D-Dress) and out-of-distribution (MVHumanNet) datasets demonstrate the effectiveness of our approach.

Paper Structure

This paper contains 41 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: GenFusion is a feed-forward human performance capture method that progressively updates a canonical space to reconstruct humans in alignment with past observations from a monocular RGB stream. For example, given only a side-view input live frame (green box), GenFusion reconstructs the striped shirt pattern in the frontal view by retrieving information observed in past frames from the canonical space. Furthermore, by utilizing the probabilistic rendering, we achieve high-fidelity view synthesis. (Gray regions in the canonical space indicate unobserved areas with no information.)
  • Figure 2: GenFusion renders novel views of live frames in a feed-forward manner from a monocular RGB stream. Given a live frame $I_t$, feature map $F_t$ is extracted and aligned to the SMPL-X template mesh, resulting in vertex-aligned feature set $S_t$. $S_t$ is fused into the canonical feature set $S_\text{can}$, updating temporal history. The canonical feature set $S_\text{can}$ is then warped and densified into the live space, forming the input $G_\text{context,t}$ for the denoising network $\mathcal{U}_\text{denoiser}$. The denoising network denoises the noisy image $Z$ into the final novel view live frame, conditioned on the live frame's state $G_\text{live,t}$. The right side shows the progressive update of the canonical space. The first column shows the feature map; the second shows its RGB rendering. Our method synthesizes realistic details even without observations (top row) and refines the canonical space as more frames are incorporated.
  • Figure 3: Deterministic regression supervision (e.g., pixel-wise loss) penalizes deformation mismatch, leading to blurry outputs, while probabilistic regression supervision focuses on perceptually realistic synthesis rather than pixel-wise mismatch.
  • Figure 4: In-domain generalization results on the 4D-Dress dataset. Monocular input stream is shown in the white boxes. Our method effectively reconstructs novel views that align with past observations. Per-frame probabilistic methods (Champ, SIFU) fail to reconstruct the details observed in prior observations. Per-frame deterministic regression methods (SHERF, GHG) struggle with synthesizing unobserved details. Temporal deterministic method NHP can leverage temporal context but produces blurry outputs. GauHuman requires per-subject optimization but delivers lower visual quality and lacks generalizability.
  • Figure 5: Cross-dataset generalization results on the MVHumanNet dataset. Left column presents the monocular input stream. Champ hallucinates irrelevant details when those regions are invisible in the current live input frame. In contrast, our method generates details relevant to past observations.
  • ...and 2 more figures