Table of Contents
Fetching ...

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

Tao Hu, Fangzhou Hong, Ziwei Liu

TL;DR

SurMo tackles the challenge of rendering time-varying clothed humans from sparse multi-view videos by introducing a surface-based $4$D motion representation encoded as a motion triplane on the body UV surface. A motion encoder lifts static pose and dynamics into a UVH space, while a motion decoder enforces physics-aware learning by predicting next-timestep surface normals $\mathbf{N^{uv}_{t+1}}$ and velocities $\mathbf{V^{uv}_{t+1}}$, all feeding a surface-conditioned renderer. The method achieves state-of-the-art results on ZJU-$MoCap$, MPII-$RDDC$, and AIST++ datasets, producing high-fidelity, view-consistent garments that exhibit fast-motion wrinkles and motion-dependent shadows with efficient rendering. By coupling a topology-guided surface representation with physically informed motion learning and a two-stage rendering pipeline, SurMo provides a practical and scalable platform for high-quality dynamic human rendering in applications such as AR/VR and telepresence.

Abstract

Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/

SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering

TL;DR

SurMo tackles the challenge of rendering time-varying clothed humans from sparse multi-view videos by introducing a surface-based D motion representation encoded as a motion triplane on the body UV surface. A motion encoder lifts static pose and dynamics into a UVH space, while a motion decoder enforces physics-aware learning by predicting next-timestep surface normals and velocities , all feeding a surface-conditioned renderer. The method achieves state-of-the-art results on ZJU-, MPII-, and AIST++ datasets, producing high-fidelity, view-consistent garments that exhibit fast-motion wrinkles and motion-dependent shadows with efficient rendering. By coupling a topology-guided surface representation with physically informed motion learning and a two-stage rendering pipeline, SurMo provides a practical and scalable platform for high-quality dynamic human rendering in applications such as AR/VR and telepresence.

Abstract

Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/
Paper Structure (22 sections, 13 equations, 13 figures, 11 tables)

This paper contains 22 sections, 13 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Given several sparse multi-view video sequences with estimated 3D body meshes, SurMo synthesizes subject-specific appearance. We specifically focus on the synthesis of plausible time-varying appearances by learning an effective 4D motion representation.
  • Figure 2: Framework overview. Given a set of time-varying 3D body meshes {$\mathbf{P_t}$, ..., $\mathbf{P_t-n}$} obtained from training video sequences, we aim to synthesize high-fidelity appearances of a clothed human in motion via a feature encoder-decoder framework: Motion Encoding, and joint Motion and Appearance Decoding. 1) We take as input an expressive 4D motion representation extracted from the mesh sequences including 3D pose, 3D velocity at time t, and motion trajectory over the past $w$ timesteps that encode both spatial and temporal relations of the motion sequence, which are projected to the spatially aligned UV surface space. A motion encoder $\mathcal{E_M}$ is employed to lift the 2D UV-aligned features to a 3D surface-based triplane $\mathbf{f^{uvh}_t}$ in an UV-plus-height space with a signed distance height to model temporal clothing offsets. 2) A motion decoder $\mathcal{D_M}$ is designed to encourage physical motion learning in training by decoding the triplane features $\mathbf{f^{uvh}_t}$ to predict the motion at the next timestep t + 1, i.e. spatial derivatives surface normal $\mathbf{N^{uv}_{t+1}}$ and temporal derivatives surface velocity $\mathbf{V^{uv}_{t+1}}$ in UV space. 3) Finally, given a target camera view, the triplane $\mathbf{f^{uvh}_t}$ is rendered into high-quality images by a volumetric surface-conditioned renderer including volumetric low-resolution rendering by $\mathcal{G}_1$ and an efficient geometry-aware super-resolution by $\mathcal{G}_2$.
  • Figure 3: Qualitative comparisons on novel view synthesis on the subject S313 of ZJU-MoCap dataset. Two motion sequences S1 (swing arms left to right) and S2 (raise and lower arms) are shown. We specifically focus on the synthesis of time-varying appearances (especially T-shirt wrinkles), by evaluating the rendering results under similar poses yet with different movement directions, which are marked in the same color, such as the pairs of ①②, ③④, and ⑤⑥. Our method synthesizes high-fidelity time-varying appearances, whereas SOTA HumanNeRF generates almost the same cloth wrinkles.
  • Figure 4: Qualitative comparisons on novel view synthesis on the subject S387, S315 of ZJU-MoCap dataset. Row 1 and 2 show similar poses occurring at different timesteps (not consecutive frames). The results indicate that our method synthesizes time-varying appearances while other methods mainly generate pose-dependent appearances.
  • Figure 5: Novel view synthesis of time-varying appearances with both pose and lighting conditioning on MPII-RDDC dataset. The sequence is captured in a studio with top-down lighting that casts shadows on the human performer due to self-occlusion. In Row 1, we specifically focus on synthesizing time-varying shadows (e.g., ① vs. ②, and ③ vs. ④) for different poses with different self-occlusions. In Row 2, we evaluate the synthesis of: 1) time-varying appearances for similar poses occurring in a jump-up-and-down motion sequence, e.g., ⑤ vs. ⑥, 2) shadows ⑦ vs. ⑧, and 3) clothing offsets ⑤ vs. ⑥.
  • ...and 8 more figures