Table of Contents
Fetching ...

Representing Animatable Avatar via Factorized Neural Fields

Chunjin Song, Zhijie Wu, Bastian Wandt, Leonid Sigal, Helge Rhodin

TL;DR

This work tackles the challenge of reconstructing animatable 3D human avatars from monocular video while preserving large-scale shapes and fine, time-varying wrinkles. It introduces a frequency-aware, two-branch neural field that factorizes per-frame outputs into a pose-independent low-frequency component and a pose-dependent high-frequency residual, computed in canonical space and merged through SDF-based volume rendering. The approach distinguishes itself from prior NeRF-based methods by performing frequency-aware rendering in output space and coupling pose-independent features with pose-dependent details via intermediate representations, achieving improved frame consistency and detail fidelity. Empirical results on ZJU-Mocap and in-the-wild YouTube sequences demonstrate significant gains in novel view synthesis, novel pose rendering, and geometric reconstruction, highlighting the method’s practical potential for high-fidelity, animatable avatars. The work advances the state-of-the-art in neural avatars by enabling multi-scale geometry and texture stability across unseen poses and viewpoints, with broader implications for video-driven character personalization and animation pipelines.

Abstract

For reconstructing high-fidelity human 3D models from monocular videos, it is crucial to maintain consistent large-scale body shapes along with finely matched subtle wrinkles. This paper explores the observation that the per-frame rendering results can be factorized into a pose-independent component and a corresponding pose-dependent equivalent to facilitate frame consistency. Pose adaptive textures can be further improved by restricting frequency bands of these two components. In detail, pose-independent outputs are expected to be low-frequency, while highfrequency information is linked to pose-dependent factors. We achieve a coherent preservation of both coarse body contours across the entire input video and finegrained texture features that are time variant with a dual-branch network with distinct frequency components. The first branch takes coordinates in canonical space as input, while the second branch additionally considers features outputted by the first branch and pose information of each frame. Our network integrates the information predicted by both branches and utilizes volume rendering to generate photo-realistic 3D human images. Through experiments, we demonstrate that our network surpasses the neural radiance fields (NeRF) based state-of-the-art methods in preserving high-frequency details and ensuring consistent body contours.

Representing Animatable Avatar via Factorized Neural Fields

TL;DR

This work tackles the challenge of reconstructing animatable 3D human avatars from monocular video while preserving large-scale shapes and fine, time-varying wrinkles. It introduces a frequency-aware, two-branch neural field that factorizes per-frame outputs into a pose-independent low-frequency component and a pose-dependent high-frequency residual, computed in canonical space and merged through SDF-based volume rendering. The approach distinguishes itself from prior NeRF-based methods by performing frequency-aware rendering in output space and coupling pose-independent features with pose-dependent details via intermediate representations, achieving improved frame consistency and detail fidelity. Empirical results on ZJU-Mocap and in-the-wild YouTube sequences demonstrate significant gains in novel view synthesis, novel pose rendering, and geometric reconstruction, highlighting the method’s practical potential for high-fidelity, animatable avatars. The work advances the state-of-the-art in neural avatars by enabling multi-scale geometry and texture stability across unseen poses and viewpoints, with broader implications for video-driven character personalization and animation pipelines.

Abstract

For reconstructing high-fidelity human 3D models from monocular videos, it is crucial to maintain consistent large-scale body shapes along with finely matched subtle wrinkles. This paper explores the observation that the per-frame rendering results can be factorized into a pose-independent component and a corresponding pose-dependent equivalent to facilitate frame consistency. Pose adaptive textures can be further improved by restricting frequency bands of these two components. In detail, pose-independent outputs are expected to be low-frequency, while highfrequency information is linked to pose-dependent factors. We achieve a coherent preservation of both coarse body contours across the entire input video and finegrained texture features that are time variant with a dual-branch network with distinct frequency components. The first branch takes coordinates in canonical space as input, while the second branch additionally considers features outputted by the first branch and pose information of each frame. Our network integrates the information predicted by both branches and utilizes volume rendering to generate photo-realistic 3D human images. Through experiments, we demonstrate that our network surpasses the neural radiance fields (NeRF) based state-of-the-art methods in preserving high-frequency details and ensuring consistent body contours.
Paper Structure (26 sections, 12 equations, 25 figures, 8 tables)

This paper contains 26 sections, 12 equations, 25 figures, 8 tables.

Figures (25)

  • Figure 1: Motivation illustration. (a) In canonical space, we separate the per-frame rendering output into a pose-independent component and its pose-dependent equivalent. These two components are modeled with distinct frequency bands, thus yielding smooth base outputs and corresponding high-frequency residuals (see Fig. \ref{['fig:mocap-lf_hf']} for details). The residual image here is amplified for better visualization. (b) Our frequency-aware factorized strategy improves the state-of-the-art methods in novel view synthesis, novel pose rendering and human shape reconstruction.
  • Figure 2: Conceptual differences. Taken a position $x_c$ in canonical space and conditioned on a pose, Vid2Avatar guo2023vid2avatar directly regresses the SDF and appearance values with a uniform frequency band and thus models pose-independent information implicitly (a). In (b), HumanNeRF weng2022humannerf and MonoHuman yu2023monohuman perform decomposition in coordinate space and use a low-frequency network to regress pose-dependent position offset (green) and a high-frequency network (red) for learning pose-independent canonical representations. In comparison, we associate the pose-independent information with low frequencies (green) and pose-dependent counterparts with high-frequencies (red) in output space to preserve multi-scale signals (c). Here $x_c$ is computed by a skeletal deformation weng2022humannerfguo2023vid2avatar.
  • Figure 3: Architecture overview. We compute the canonical coordinate $x_c$ of the query point $x_o$ in observation space by performing the skeletal deformation. Then $x_c$ is fed into two branches with the low-frequency (green) and high-frequency (red) positional encoding for pose-independent ($\{s_1, c_1\}$) and pose-dependent ($\{s_2, c_2\}$) outputs respectively. We input their combinations $\{s, c\}$ to volume rendering to generate images under different view directions and human poses.
  • Figure 4: Frequency constraints. To validate our frequency assumption, we train a set of two-branch models with different $L_{ind}$ and $L_{d}$. For simplicity, we denote a model with $L_{ind}=x$ and $L_{d}=y$ as $[x, y]$. Adhering to our network design, the pose-independent branch outputs the low-frequency base normal map as $[5, 10]_{lf}$ while our full model estimates an output with all frequencies as $[5, 10]_{lf+hf}$. Increasing frequency in the pose-independent output, denoted as $[8, 10]$, can yield more grainy geometric patterns in $[8, 10]_{lf}$ but stops the full model from generating sharp pose-dependent wrinkles in $[8, 10]_{lf+hf}$. Simply training the pose-dependent branch ($[10]$ with $L_{d}=10$) fails to synthesize desired multi-scale patterns. See Sec. \ref{['sec:s_field']} and Sec. \ref{['sec:supp_abtest']} in appendix for more discussions.
  • Figure 5: Novel pose rendering on Youtube sequences. While baselines distort the marked arms with floating noise, our method yields more visually appealing body outlines. We also improve Vid2Avatar with more realistic textures like cloth buttons.
  • ...and 20 more figures