RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Ming Yan; Yan Zhang; Shuqiang Cai; Shuqi Fan; Xincheng Lin; Yudi Dai; Siqi Shen; Chenglu Wen; Lan Xu; Yuexin Ma; Cheng Wang

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Ming Yan, Yan Zhang, Shuqiang Cai, Shuqi Fan, Xincheng Lin, Yudi Dai, Siqi Shen, Chenglu Wen, Lan Xu, Yuexin Ma, Cheng Wang

TL;DR

RELI11D tackles the challenge of holistic human pose estimation by providing a first-of-its-kind multimodal dataset that fuses RGB, LiDAR, IMU, and Event modalities across 10 actors, 5 sports, and 7 scenes, totaling 3.32 hours of synchronized data. The dataset is complemented by a multimodal baseline, LEIR, which uses cross-attention fusion to integrate geometry, appearance, and motion dynamics for global pose and trajectory estimation. The authors demonstrate that multimodal fusion significantly improves HPE performance, especially for rapid and complex motions, and provide extensive benchmarks across multiple HPE tasks. This work lays the groundwork for robust, scene-aware human motion understanding and offers a valuable resource for future multimodal HPE research and applications.

Abstract

Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However, solely using these modalities or a combination of them may not be adequate for HPE, particularly for complex and fast movements. For holistic human motion understanding, we present RELI11D, a high-quality multimodal human motion dataset involves LiDAR, IMU system, RGB camera, and Event camera. It records the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event steams. Through extensive experiments, we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities, we propose LEIR, a multimodal baseline that effectively utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community, fostering collaboration and enabling further exploration in this field.

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 7 figures, 7 tables)

This paper contains 21 sections, 2 equations, 7 figures, 7 tables.

Introduction
Related Work
Single modality Datasets and Methods
Multi-modality Datasets and Methods
RELI11D: a multimodal motion dataset
Hardware and Configuration
Data Annotation Pipeline
Multimodal Data Pre-processing Stage
Consolidated Optimization
Manual Annotation and Verification Stage
LEIR: A multimodal HPE baseline
Feature Extraction
Temporal unified multimodal model (TUMM)
SMPL-Based inverse motion solver
Experimental Results
...and 6 more sections

Figures (7)

Figure 1: RELI11D is a high-quality dataset that provides four different modalities and records movement actions(first two rows). Our dataset's annotation pipeline can provide accurate global SMPL joints, poses as well as global human motion trajectories(last row).
Figure 2: RELI11D provides rich data and annotations: (a) RGB Videos, (b) 2D Annotation, (c) 2D SMPL Poses, (d) Events, (e) 3D Point Clouds, (f) 2D Point Clouds, (g) High Precision Scene Meshes, (h) 3D SMPL Shape, Poses, and Trajectories, (i) IMUs Measurements.
Figure 3: Portable Human Motion Capturing system.
Figure 4: Overview of main annotation pipeline. The dotted boxes of different colors represent different data processing stages, and the arrows represent the data flow direction. Dotted box: The input of each scene sequence consists of RGB videos, point cloud sequences, IMU measurements, events flow(times axis), and 3D laser scanning data. The data pre-processing stage calibrates and synchronizes different modalities. The consolidated optimization includes the global pose and translation based on multiple constraint losses.
Figure 5: Overview of LEIR method (Left) and Multimodal Cross-Attention Unit (Right).Orange arrows represent different modalities of data input. Dark blue arrows represent the inputs and outputs data flows of the TUMM model. Dotted arrows represent the predicted data and calculation loss with ground truth.
...and 2 more figures

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

TL;DR

Abstract

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method

Authors

TL;DR

Abstract

Table of Contents

Figures (7)