SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering
Tao Hu, Fangzhou Hong, Ziwei Liu
TL;DR
SurMo tackles the challenge of rendering time-varying clothed humans from sparse multi-view videos by introducing a surface-based $4$D motion representation encoded as a motion triplane on the body UV surface. A motion encoder lifts static pose and dynamics into a UVH space, while a motion decoder enforces physics-aware learning by predicting next-timestep surface normals $\mathbf{N^{uv}_{t+1}}$ and velocities $\mathbf{V^{uv}_{t+1}}$, all feeding a surface-conditioned renderer. The method achieves state-of-the-art results on ZJU-$MoCap$, MPII-$RDDC$, and AIST++ datasets, producing high-fidelity, view-consistent garments that exhibit fast-motion wrinkles and motion-dependent shadows with efficient rendering. By coupling a topology-guided surface representation with physically informed motion learning and a two-stage rendering pipeline, SurMo provides a practical and scalable platform for high-quality dynamic human rendering in applications such as AR/VR and telepresence.
Abstract
Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However, existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template, which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo/
