ViMo: Generating Motions from Casual Videos
Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han
TL;DR
ViMo addresses the challenge of generating diverse, realistic 3D human motions from casual videos by introducing a diffusion-based video-to-motion framework conditioned on multi-view 2D poses, avoiding explicit camera pose estimation. The method uses a transformer-diffusion network with FiLM conditioning and classifier-free guidance, optimized with joints, velocity, and foot-contact losses to ensure temporal coherence and physical plausibility. It demonstrates three practical applications: large-scale 3D motion dataset construction via diffusion, few-shot dancing stylization from videos, and video-guided motion completion, and shows superiority over MotionBERT in both quantitative metrics and user studies. The work enables scalable motion generation from ubiquitous video content, with potential to expand motion datasets, enable style transfer, and support motion editing in real-world pipelines.
Abstract
Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.
