Table of Contents
Fetching ...

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

Liang Xu, Shaoyang Hua, Zili Lin, Yifan Liu, Feipeng Ma, Yichao Yan, Xin Jin, Xiaokang Yang, Wenjun Zeng

TL;DR

The paper introduces MotionBank, a large-scale, in-the-wild video motion benchmark with 1.24M motion sequences and 132.9M frames, augmented by automatic rule-based, disentangled posecode captions to improve motion-text alignment. It builds MotionBank by extracting SMPL parameters from 13 video datasets and automating caption generation via posecodes, enabling robust benchmarks for motion generation, in-context generation, and understanding. A large motion model framework with motion quantization (VQ-VAE), motion-language pre-training, and adaptive tuning demonstrates the dataset’s utility, showing improved diversity and downstream task performance (e.g., HumanML3D, BEHAVE) and enabling controllable rule-text-based motion generation. The work argues MotionBank as a practical, scalable alternative to MoCap-centric data for fostering open-vocabulary, context-aware motion models, with public release of data, code, and benchmarks.

Abstract

In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous motion benchmarks primarily focus on pure body movements, neglecting the ubiquitous motions in context, i.e., humans interacting with humans, objects, and scenes. To address these limitations, we consolidate large-scale video action datasets as knowledge banks to build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Different from laboratory-captured motions, in-the-wild human-centric videos contain abundant motions in context. To facilitate better motion text alignment, we also meticulously devise a motion caption generation algorithm to automatically produce rule-based, unbiased, and disentangled text descriptions via the kinematic characteristics for each motion. Extensive experiments show that our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding. Video motions together with the rule-based text annotations could serve as an efficient alternative for larger LMMs. Our dataset, codes, and benchmark will be publicly available at https://github.com/liangxuy/MotionBank.

MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations

TL;DR

The paper introduces MotionBank, a large-scale, in-the-wild video motion benchmark with 1.24M motion sequences and 132.9M frames, augmented by automatic rule-based, disentangled posecode captions to improve motion-text alignment. It builds MotionBank by extracting SMPL parameters from 13 video datasets and automating caption generation via posecodes, enabling robust benchmarks for motion generation, in-context generation, and understanding. A large motion model framework with motion quantization (VQ-VAE), motion-language pre-training, and adaptive tuning demonstrates the dataset’s utility, showing improved diversity and downstream task performance (e.g., HumanML3D, BEHAVE) and enabling controllable rule-text-based motion generation. The work argues MotionBank as a practical, scalable alternative to MoCap-centric data for fostering open-vocabulary, context-aware motion models, with public release of data, code, and benchmarks.

Abstract

In this paper, we tackle the problem of how to build and benchmark a large motion model (LMM). The ultimate goal of LMM is to serve as a foundation model for versatile motion-related tasks, e.g., human motion generation, with interpretability and generalizability. Though advanced, recent LMM-related works are still limited by small-scale motion data and costly text descriptions. Besides, previous motion benchmarks primarily focus on pure body movements, neglecting the ubiquitous motions in context, i.e., humans interacting with humans, objects, and scenes. To address these limitations, we consolidate large-scale video action datasets as knowledge banks to build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions. Different from laboratory-captured motions, in-the-wild human-centric videos contain abundant motions in context. To facilitate better motion text alignment, we also meticulously devise a motion caption generation algorithm to automatically produce rule-based, unbiased, and disentangled text descriptions via the kinematic characteristics for each motion. Extensive experiments show that our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding. Video motions together with the rule-based text annotations could serve as an efficient alternative for larger LMMs. Our dataset, codes, and benchmark will be publicly available at https://github.com/liangxuy/MotionBank.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Difference illustration of our proposed MotionBank. (a) Previous datasets are collected from optical/inertial motion capture (MoCap) systems, multi-view cameras, and manual text annotations. (b) For MotionBank, we collect vast in-the-wild human-centric videos from the public and extract human motion from them. We also devise an algorithm to automatically generate the rule-based, fine-grained, and disentangled motion captions as the corresponding text annotations.
  • Figure 2: Visualization of semantic space distributions between Motion-X and MotionBank.
  • Figure 3: Visualization of the video motion data of MotionBank. The crowdsourcing 13 video action datasets contain abundant 1) In-the-wild daily activities, 2) Sports; 3) Natural and diverse human in context motions, i.e., human-human, object, scene interactions.
  • Figure 4: The pipeline of text motion alignment. (a) Previous methods adopt the direct mapping between motions and human-like texts. (b) Our MotionBank take the rule-based texts as a bridge to narrow the gap between motions and human-like texts via fine-tuning.
  • Figure 5: Examples of the generated motion captions. In this example from Multisports li2021multisports, the man steps back from a half-squat pose. Our generated results can correctly capture the dynamics and semantics.
  • ...and 2 more figures