Table of Contents
Fetching ...

MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

Roy Kapon, Guy Tevet, Daniel Cohen-Or, Amit H. Bermano

TL;DR

MAS tackles the challenge of 3D motion generation without direct 3D data by learning a 2D diffusion prior from in-the-wild videos and performing multi-view ancestral sampling. At each step, multiple 2D views of the same motion are denoised jointly, triangulated into a coherent 3D sequence, and reprojected back to views to guide subsequent steps, yielding diverse and realistic 3D motions. The approach achieves strong results in domains with scarce 3D data (NBA, horses, rhythmic gymnastics) with fast inference on a single GPU and shows robustness against typical diffusion-based sampling failures. This work broadens the reachable motion domains from monocular video and offers a scalable alternative to reliance on curated 3D motion datasets, potentially benefiting animation, robotics, and AR/VR applications.

Abstract

We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/

MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion

TL;DR

MAS tackles the challenge of 3D motion generation without direct 3D data by learning a 2D diffusion prior from in-the-wild videos and performing multi-view ancestral sampling. At each step, multiple 2D views of the same motion are denoised jointly, triangulated into a coherent 3D sequence, and reprojected back to views to guide subsequent steps, yielding diverse and realistic 3D motions. The approach achieves strong results in domains with scarce 3D data (NBA, horses, rhythmic gymnastics) with fast inference on a single GPU and shows robustness against typical diffusion-based sampling failures. This work broadens the reachable motion domains from monocular video and offers a scalable alternative to reliance on curated 3D motion datasets, potentially benefiting animation, robotics, and AR/VR applications.

Abstract

We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/
Paper Structure (18 sections, 2 theorems, 11 equations, 6 figures, 6 tables)

This paper contains 18 sections, 2 theorems, 11 equations, 6 figures, 6 tables.

Key Result

Theorem 1

Let $\varepsilon=\left(\right)\sim \mathcal{N}\left(0,I_{3\times3}\right)$ and let $P\in \mathbb{R}^{2\times 3}$ be an orthogonal projection matrix, then $P\cdot\varepsilon\sim\mathcal{N}\left(0,I_{2\times2}\right)$.

Figures (6)

  • Figure 1: 3D motions generated by Multi-view Ancestral Sampling (MAS) --- each one using a different initial noise. Our method generates novel 3D motions using a 2D diffusion model. As such, it enables learning intricate 3D motion synthesis solely from monocular video data.
  • Figure 2: Preparations. The motion diffusion model used for MAS is trained on 2D motion estimations of videos scraped from the web.
  • Figure 3: The figure illustrates an overview of MAS, showing a multi-view denoising step from the 2D sample collection $x_t^{1:V}$ to $x_{t-1}^{1:V}$, corresponding to camera views ${v}_{1:V}$. Denoising is performed by a fixed 2D motion diffusion model $G_{2D}$. At each such iteration, our Consistency Block triangulates the motion predictions $\hat{x}_0^{1:V}$ into a single 3D sequence and projects it back onto each view ($\tilde{x}_0^{1:V}$). To encourage consistency in the model's predictions, we sample 3D noise, $\epsilon_{3D}$ and project it to the 2D noise ${\epsilon^{v}}$ for each view. Finally, we sample $x_{t-1}^{1:V}$ from $q\left(x^{1:V}_{t-1}|x^{1:V}_t,\tilde{x_0}^{1:V}\right)$.
  • Figure 4: Generated motions by MAS compared to ElePose wandt2021elepose, MotionBert zhu2023motionbert, and an adaptation of DreamFusion poole2022dreamfusion to unconditioned motion generation. We observe that MotionBert and DreamFusion produce dull motions with limited movement and ElePose predictions are jittery and often include invalid poses (Red rectangles).
  • Figure 5: NBA Dataset User study. We asked $22$ unique users to compare $15$ randomly generated motions by each of the models to MAS generations in $3$ aspects - precision (i.e. what samples best depict Basketball moves), Overall Quality and Diversity. The dashed line marks $50\%$. MAS outperforms the lifting methods and the DreamFusion adaptation.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof