Table of Contents
Fetching ...

Diffusion Model-based Activity Completion for AI Motion Capture from Videos

Gao Huayu, Huang Tengjiu, Ye Xiaolong, Tsuyoshi Okita

TL;DR

This work tackles the limitation of AI motion capture systems that are confined to observed sequences by introducing diffusion-model-based action completion for virtual humans. The proposed MDC-Net delivers seamless transitions between motion fragments and supports arbitrary-length output, aided by a gate module and a position-time embedding that improve temporal coherence. Evaluations on Human3.6M show competitive ADE, FDE, and MMADE performance with a smaller footprint than baselines, and the approach includes a pipeline to derive IMU-like sensor data from generated motions. The method promises cost-effective, flexible MoCap for interactive applications, while acknowledging remaining challenges in enforcing physical plausibility and mesh-to-skeleton accuracy.

Abstract

AI-based motion capture is an emerging technology that offers a cost-effective alternative to traditional motion capture systems. However, current AI motion capture methods rely entirely on observed video sequences, similar to conventional motion capture. This means that all human actions must be predefined, and movements outside the observed sequences are not possible. To address this limitation, we aim to apply AI motion capture to virtual humans, where flexible actions beyond the observed sequences are required. We assume that while many action fragments exist in the training data, the transitions between them may be missing. To bridge these gaps, we propose a diffusion-model-based action completion technique that generates complementary human motion sequences, ensuring smooth and continuous movements. By introducing a gate module and a position-time embedding module, our approach achieves competitive results on the Human3.6M dataset. Our experimental results show that (1) MDC-Net outperforms existing methods in ADE, FDE, and MMADE but is slightly less accurate in MMFDE, (2) MDC-Net has a smaller model size (16.84M) compared to HumanMAC (28.40M), and (3) MDC-Net generates more natural and coherent motion sequences. Additionally, we propose a method for extracting sensor data, including acceleration and angular velocity, from human motion sequences.

Diffusion Model-based Activity Completion for AI Motion Capture from Videos

TL;DR

This work tackles the limitation of AI motion capture systems that are confined to observed sequences by introducing diffusion-model-based action completion for virtual humans. The proposed MDC-Net delivers seamless transitions between motion fragments and supports arbitrary-length output, aided by a gate module and a position-time embedding that improve temporal coherence. Evaluations on Human3.6M show competitive ADE, FDE, and MMADE performance with a smaller footprint than baselines, and the approach includes a pipeline to derive IMU-like sensor data from generated motions. The method promises cost-effective, flexible MoCap for interactive applications, while acknowledging remaining challenges in enforcing physical plausibility and mesh-to-skeleton accuracy.

Abstract

AI-based motion capture is an emerging technology that offers a cost-effective alternative to traditional motion capture systems. However, current AI motion capture methods rely entirely on observed video sequences, similar to conventional motion capture. This means that all human actions must be predefined, and movements outside the observed sequences are not possible. To address this limitation, we aim to apply AI motion capture to virtual humans, where flexible actions beyond the observed sequences are required. We assume that while many action fragments exist in the training data, the transitions between them may be missing. To bridge these gaps, we propose a diffusion-model-based action completion technique that generates complementary human motion sequences, ensuring smooth and continuous movements. By introducing a gate module and a position-time embedding module, our approach achieves competitive results on the Human3.6M dataset. Our experimental results show that (1) MDC-Net outperforms existing methods in ADE, FDE, and MMADE but is slightly less accurate in MMFDE, (2) MDC-Net has a smaller model size (16.84M) compared to HumanMAC (28.40M), and (3) MDC-Net generates more natural and coherent motion sequences. Additionally, we propose a method for extracting sensor data, including acceleration and angular velocity, from human motion sequences.

Paper Structure

This paper contains 18 sections, 4 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Human Motion Completion. $H1$ and $H2$ are two human motions that can either be different or the same. Using a generative model and inference, we produce an intermediate motion sequence, P, to connect and complete these two motions.
  • Figure 2: This is the flowchart of MDC-Net. We embed the input data into the DCT domain and use a mask to get our required part of these sequences.
  • Figure 3: Different padding strategies. We conducted experiments on P using the following four strategies: From first line to fourth line of figure, 1. Filling P with the last frame of $H1$ and the first frame of $H2$ respectively; 2. Setting all element of P to zero. 3. Filling all elements of P with the last frame of $H1$; 4. Filling all elements of P with the first frame of $H2$.
  • Figure 4: Mask.The gray segment represents the sequences after padding, while the black segment represents the noise sequence $P$. $H1\{ X_(n-m+1), ... , X_n\}$ and $H2\{ Y_1, ... , Y_k \}$ are the motion sequence that input into the model. By multiplying the matrix M with the gray sequences, the inital motion sequences can be extracted. Then, by multiplying the 1-M with the black sequence, the sequence that need to be generted can be extracted. Finally, adding these two parts togther yields the complete sequence.
  • Figure 5: Baseline. In the figure, nframes represents the total n frames that input into model. Similarly, nfeats represents the number of keypoints and their xyz coordinates.
  • ...and 11 more figures