Table of Contents
Fetching ...

MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh Animation

Lukas Uzolas, Elmar Eisemann, Petr Kellnhofer

TL;DR

<3-5 sentence high-level summary> MotionDreamer addresses the challenge of re-animating unseen 3D meshes without target-domain training by leveraging semantic features from pre-trained video diffusion models to guide pose fitting on explicit mesh representations. The approach textures a given mesh, generates a motion sequence via a VDM-conditioned rendering, and optimizes per-frame pose offsets by matching semantic diffusion features across frames. Evaluations across two VDM backbones and four animation models show favorable motion quality in a user study and competitive pose-fitting accuracy with reduced runtime compared to end-to-end 4D methods. The work enables fast, zero-shot re-animation of diverse assets within standard graphics pipelines and opens avenues for diffusion-guided motion analysis and asset authoring.

Abstract

Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of various 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at https://lukas.uzolas.com/MotionDreamer.

MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh Animation

TL;DR

<3-5 sentence high-level summary> MotionDreamer addresses the challenge of re-animating unseen 3D meshes without target-domain training by leveraging semantic features from pre-trained video diffusion models to guide pose fitting on explicit mesh representations. The approach textures a given mesh, generates a motion sequence via a VDM-conditioned rendering, and optimizes per-frame pose offsets by matching semantic diffusion features across frames. Evaluations across two VDM backbones and four animation models show favorable motion quality in a user study and competitive pose-fitting accuracy with reduced runtime compared to end-to-end 4D methods. The work enables fast, zero-shot re-animation of diverse assets within standard graphics pipelines and opens avenues for diffusion-guided motion analysis and asset authoring.

Abstract

Animation techniques bring digital 3D worlds and characters to life. However, manual animation is tedious and automated techniques are often specialized to narrow shape classes. In our work, we propose a technique for automatic re-animation of various 3D shapes based on a motion prior extracted from a video diffusion model. Unlike existing 4D generation methods, we focus solely on the motion, and we leverage an explicit mesh-based representation compatible with existing computer-graphics pipelines. Furthermore, our utilization of diffusion features enhances accuracy of our motion fitting. We analyze efficacy of these features for animation fitting and we experimentally validate our approach for two different diffusion models and four animation models. Finally, we demonstrate that our time-efficient zero-shot method achieves a superior performance re-animating a diverse set of 3D shapes when compared to existing techniques in a user study. The project website is located at https://lukas.uzolas.com/MotionDreamer.
Paper Structure (58 sections, 7 equations, 21 figures, 8 tables)

This paper contains 58 sections, 7 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Our Zero-shot 3D mesh animations. From top to bottom: The desired motion description, the resulting animated mesh with motion contours, the driving video from a pre-trained video diffusion model. Notice robustness of our method to the temporal identity shift (a) and to the geometric distortions (b). Diverse shapes are supported through a range of animation models including a) FLAME FLAME:SiggraphAsia2017, b) Neural Jacobian Fields aigerman2022njf and c) SMAL zuffi20173d.
  • Figure 2: A diagram of our method. First, we automatically texture the input mesh $\mathcal{M}$ to reduce the domain gap to the VDM prior (Sec. \ref{['sec:scene_init']}). Second, we condition the VDM by a rendered image $\mathbf{I}{}_{rgb}$ to produce a video with motion and to extract features $\hat{\mathbf{A}}$ for all $L$ frames from its internal U-Net (Sec. \ref{['sec:motion_generation']}). Finally, we reproject the input frame features $\hat{\mathbf{A}}^0$ on the mesh surface and we optimize mesh animation parameters $\mathbf{p}$ to match the reposed mesh features to the video (Sec. \ref{['sec:optimization']}).
  • Figure 3: A qualitative comparison of our method to DG4D and MDM-MT for the prompts and the shapes used in our study. We display 2 untextured views of the last frame with one one additional textured image for reference. The contours convey the motion trajectory.
  • Figure 4: Left: Results of the user study, asking the question: "Which video... ?" For the first three questions we compare our method against untextured renders of DG4D and MDM-MT. For the last question we compare against the full RGB outputs of DG4D and the VDM output. *** denotes significance at $p < 0.001$ (bars show 95% confidence intervals). Right: Pose fitting errors for $\mathbf{A}^{\hat{t}}_u$ extracted across U-Net layers $u$ with bars showing standard deviations.
  • Figure 5: Left: PA-MPJPE $\downarrow$ with a standard deviation range for features $\mathbf{A}^t_{\hat{u}}$ extracted for different diffusion steps $t$. Right: Depth regularization prevents undesirable motion-in-depth explanations.
  • ...and 16 more figures