Table of Contents
Fetching ...

Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling

Qi Sun, Can Wang, Jiaxiang Shang, Yingchun Liu, Jing Liao

TL;DR

A novel self-guided stochastic sampling method is proposed, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity).

Abstract

Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non-rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non-rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self-guided stochastic sampling method, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of the residual non-rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in https://github.com/qiisun/ani3dhuman.

Ani3DHuman: Photorealistic 3D Human Animation with Self-guided Stochastic Sampling

TL;DR

A novel self-guided stochastic sampling method is proposed, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity).

Abstract

Current 3D human animation methods struggle to achieve photorealism: kinematics-based approaches lack non-rigid dynamics (e.g., clothing dynamics), while methods that leverage video diffusion priors can synthesize non-rigid motion but suffer from quality artifacts and identity loss. To overcome these limitations, we present Ani3DHuman, a framework that marries kinematics-based animation with video diffusion priors. We first introduce a layered motion representation that disentangles rigid motion from residual non-rigid motion. Rigid motion is generated by a kinematic method, which then produces a coarse rendering to guide the video diffusion model in generating video sequences that restore the residual non-rigid motion. However, this restoration task, based on diffusion sampling, is highly challenging, as the initial renderings are out-of-distribution, causing standard deterministic ODE samplers to fail. Therefore, we propose a novel self-guided stochastic sampling method, which effectively addresses the out-of-distribution problem by combining stochastic sampling (for photorealistic quality) with self-guidance (for identity fidelity). These restored videos provide high-quality supervision, enabling the optimization of the residual non-rigid motion field. Extensive experiments demonstrate that \MethodName can generate photorealistic 3D human animation, outperforming existing methods. Code is available in https://github.com/qiisun/ani3dhuman.
Paper Structure (77 sections, 3 theorems, 19 equations, 17 figures, 5 tables, 1 algorithm)

This paper contains 77 sections, 3 theorems, 19 equations, 17 figures, 5 tables, 1 algorithm.

Key Result

Proposition B.1

Consider the score approximation $\nabla_{\mathbf{x}_t} \log p(\mathbf{y} | \mathbf{x}_t) \approx \nabla_{\mathbf{x}_t} \log p(\mathbf{y} | \hat{\mathbf{x}}_{0|t})$ used in Eq.(10). Let $\mathcal{M}$ be the measurement operator and $\hat{\mathbf{x}}_{0|t} = \mathbb{E}[\mathbf{x}_0 | \mathbf{x}_t]$ b where $C$ is a constant related to the Lipschitz property of the noise schedule.

Figures (17)

  • Figure 1: Given a reference human image and a target SMPL mesh sequence, our method synthesizes photorealistic 3D human animation. Unlike the previous state-of-the-art (SOTA) methods (e.g., LHM qiu2025LHM(top-right)) that are limited to rigid motion, our Ani3DHuman(bottom) can further generate high-fidelity nonrigid dynamics, capturing the natural flow of the dress.
  • Figure 2: Pipeline overview. Our Ani3DHuman animates a 3D Gaussian $\mathcal{G}$ (reconstructed with LHM qiu2025LHM from the reference image) with a mesh sequence. Our layered motion combines a mesh-rigged motion with a residual field for non-rigid dynamics. A coarse rendering ${\bm{y}}$ from the rigid motion is restored to a high-quality video ${\bm{x}}^*$ using our self-guided stochastic sampling. This restored video ${\bm{x}}^*$ then provides supervision to progressively optimize the residual motion field.
  • Figure 3: Distribution mismatch in deterministic flow matching. Our degraded input ${\bm{y}}$ (out-of-distribution, OOD) creates a noisy latent ${\bm{x}}_t$ that is off the marginal distribution $p_t({\bm{x}})$. A deterministic Flow-ODE (orange path) follows an incorrect trajectory as its velocity predictions are inaccurate for OOD samples, resulting in a low-quality sample. This motivates our use of an SDE sampler, which can actively correct the path by driving the sample back toward the marginal distribution.
  • Figure 4: Diagonal view-time sampling. (a) Illustration of diagonal sampling in a view-time matrix ($N_\text{traj}=3$). This method simultaneously evolves the camera view and time, distinct from fixed-time (bullet-time) or fixed-camera (independent-view) sampling. (b) An example trajectory shows the camera orbiting 360° as time progresses.
  • Figure 5: Comparison with state-of-the-art methods. Our method (Ours) is the only one to simultaneously achieve high quality, identity preservation, and realistic non-rigid motion. Existing methods fail in key areas: Disco4D pang2025disco4d and SV4D 2.0 yao2024sv4d2 suffers from low quality (due to SDS and multi-view video diffusion); PERSONA loses identity (due to direct reconstruction from pose-driven video diffusion); and LHM qiu2025LHM captures identity but fails to model clothing dynamics. (* self-implementation)
  • ...and 12 more figures

Theorems & Definitions (3)

  • Proposition B.1: Error Bound of Gradient Approximation
  • Proposition B.2: SDE Correction Mechanism Karras2022edm
  • Proposition B.3: Equivalence of Stochastic Term.