Table of Contents
Fetching ...

ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

Deok-Kyeong Jang, Dongseok Yang, Deok-Yun Jang, Byeoli Choi, Donghoon Shin, Sung-hee Lee

TL;DR

ELMO tackles the problem of recovering high-fidelity 3D motion in real time from a single LiDAR sensor operating at 20 fps. It introduces a conditional autoregressive transformer-based upsampling framework that leverages motion and point-cloud embeddings, a motion prior, and a novel LiDAR-based data augmentation to achieve 60 fps mocap with low latency. A one-time skeleton calibration model and a synthetic LiDAR simulator enable robust personalization and broad generalization across diverse subjects and environments. Comprehensive quantitative and qualitative evaluations show substantial performance gains over state-of-the-art methods, and real-time demonstrations in Unity illustrate practical applicability for streaming and interactive gaming.

Abstract

This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at {\blue \url{https://movin3d.github.io/ELMO_SIGASIA2024/}}

ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

TL;DR

ELMO tackles the problem of recovering high-fidelity 3D motion in real time from a single LiDAR sensor operating at 20 fps. It introduces a conditional autoregressive transformer-based upsampling framework that leverages motion and point-cloud embeddings, a motion prior, and a novel LiDAR-based data augmentation to achieve 60 fps mocap with low latency. A one-time skeleton calibration model and a synthetic LiDAR simulator enable robust personalization and broad generalization across diverse subjects and environments. Comprehensive quantitative and qualitative evaluations show substantial performance gains over state-of-the-art methods, and real-time demonstrations in Unity illustrate practical applicability for streaming and interactive gaming.

Abstract

This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at {\blue \url{https://movin3d.github.io/ELMO_SIGASIA2024/}}

Paper Structure

This paper contains 27 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overall network architectures. (a) Detail of the feature extraction pipeline. (b) Overview of generator for real-time upsampling LiDAR motion capture in run-time.
  • Figure 2: Constructing the motion prior in the training phase.
  • Figure 3: Top: Snapshot of our LiDAR simulator. Red dots represent collision points between the simulated lasers and the body mesh animated with the augmented motion clips. Bottom: Augmentation results using mirroring and simulation for 90°, 180°, and 270° global rotations. The yellow character represents the original data, while the blue characters represent the augmented data.
  • Figure 4: Samples of random SMPL body meshes in A-poses with corresponding skeletons and simulated point clouds.
  • Figure 5: Samples of offline outputs of ablation models. From left to right: ELMO with future frame input and data augmentation (Yellow), only with future frame input (Blue), and the baseline (Pink).
  • ...and 7 more figures