Table of Contents
Fetching ...

ViMo: Generating Motions from Casual Videos

Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

TL;DR

ViMo addresses the challenge of generating diverse, realistic 3D human motions from casual videos by introducing a diffusion-based video-to-motion framework conditioned on multi-view 2D poses, avoiding explicit camera pose estimation. The method uses a transformer-diffusion network with FiLM conditioning and classifier-free guidance, optimized with joints, velocity, and foot-contact losses to ensure temporal coherence and physical plausibility. It demonstrates three practical applications: large-scale 3D motion dataset construction via diffusion, few-shot dancing stylization from videos, and video-guided motion completion, and shows superiority over MotionBERT in both quantitative metrics and user studies. The work enables scalable motion generation from ubiquitous video content, with potential to expand motion datasets, enable style transfer, and support motion editing in real-world pipelines.

Abstract

Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.

ViMo: Generating Motions from Casual Videos

TL;DR

ViMo addresses the challenge of generating diverse, realistic 3D human motions from casual videos by introducing a diffusion-based video-to-motion framework conditioned on multi-view 2D poses, avoiding explicit camera pose estimation. The method uses a transformer-diffusion network with FiLM conditioning and classifier-free guidance, optimized with joints, velocity, and foot-contact losses to ensure temporal coherence and physical plausibility. It demonstrates three practical applications: large-scale 3D motion dataset construction via diffusion, few-shot dancing stylization from videos, and video-guided motion completion, and shows superiority over MotionBERT in both quantitative metrics and user studies. The work enables scalable motion generation from ubiquitous video content, with potential to expand motion datasets, enable style transfer, and support motion editing in real-world pipelines.

Abstract

Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.
Paper Structure (27 sections, 8 equations, 6 figures, 2 tables)

This paper contains 27 sections, 8 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: An example of Casual Video. This is a game advertisement video. The camera's perspective and distance from the actor constantly change, with some joints being obscured. The character at the bottom is the 3D motion clips generated by our method, ViMo, which can extract comparable 3D motions from casual videos. Traditional human pose reconstruction methods often fails to obtain plausible motions on these casual videos where the cameras are complex and occlusion of characters exists from beginning to end.
  • Figure 2: Diagram of our ViMo pipeline. ViMo takes a sequence of 2D poses as input conditional signal $c$. Then it will process a denoise process to obtain a 3D motion sequence from time $t = T$ to $t = 0$. Note that the motion itself is the prediction at each denoising step. One of the advantages is the diffusion process performs robust on the casual videos and could generate corresponding 3D motions without estimating the precise camera positions.
  • Figure 3: Constructed Chinese classic dance dataset. ViMo can help to build high-quality motion data and utilize these data to benefit downstream tasks as our other applications illustrate.
  • Figure 4: Dance Motion Stylization given arbitrary music. Given a few references videos, our ViMo can conveniently extract motions and feed these data to music-to-dance models to learn motions in the corresponding style. This enables generate one style motion, kungfu style for instance, from a totally different kind of music such as pop music.
  • Figure 5: Qualitative motion results with different methods. It can be seen that proposed ViMo generate relatively more plausible motions compared with other methods.
  • ...and 1 more figures