Motion Capture from Inertial and Vision Sensors

Xiaodong Chen; Wu Liu; Qian Bao; Xinchen Liu; Ruoli Dai; Yongdong Zhang; Tao Mei

Motion Capture from Inertial and Vision Sensors

Xiaodong Chen, Wu Liu, Qian Bao, Xinchen Liu, Ruoli Dai, Yongdong Zhang, Tao Mei

TL;DR

The paper tackles the difficulty of realizing accurate motion capture with consumer devices by fusing monocular video and sparse IMUs. It introduces MINIONS, a large-scale, multi-modal dataset with 5.5M frames, 146 actions, 36 actors, and rich annotations for 2D/3D joints, SMPL parameters, and textures, collected with eight 2K cameras, 17 IMUs, and an RGB-D scanner. The core method, SparseNet, is a two-branch Bayesian fusion framework that uses a Matrix Fisher prior over $SO(3)$ for joint rotations and a von Mises–Fisher model for bone directions, integrating visual and inertial cues into a posterior $p(\theta|d_{v},D_{I})$ to recover full-body pose and translation, including a PNP-based global translation estimate. Extensive experiments on MINIONS and TotalCapture show that combining 4–6 IMUs with a monocular camera yields stable, drift-free motion capture, while monocular or IMU-only setups suffer from jitter or drift individually; increasing IMUs beyond eight yields diminishing returns. The dataset enables downstream tasks such as 2D-to-3D pose estimation and fine-grained action recognition, advancing the practicality and accessibility of multi-modal motion capture for daily-life applications.

Abstract

Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.

Motion Capture from Inertial and Vision Sensors

TL;DR

for joint rotations and a von Mises–Fisher model for bone directions, integrating visual and inertial cues into a posterior

to recover full-body pose and translation, including a PNP-based global translation estimate. Extensive experiments on MINIONS and TotalCapture show that combining 4–6 IMUs with a monocular camera yields stable, drift-free motion capture, while monocular or IMU-only setups suffer from jitter or drift individually; increasing IMUs beyond eight yields diminishing returns. The dataset enables downstream tasks such as 2D-to-3D pose estimation and fine-grained action recognition, advancing the practicality and accessibility of multi-modal motion capture for daily-life applications.

Abstract

Paper Structure (16 sections, 7 equations, 8 figures, 5 tables)

This paper contains 16 sections, 7 equations, 8 figures, 5 tables.

Introduction
Related Work
The MINIONS Dataset
Multimedia Hardware Setup
Calibration
Textured Mesh Reconstruction
Global Motion Annotations
Dataset Statistics
Multi-modal Human Motion Capture
Theoretical Assumptions
Network Structure
Experiments
Implementation Details
Multi-modal Human Motion Capture
Benchmarks on other Tasks
...and 1 more sections

Figures (8)

Figure 1: Overview of our MINIONS dataset. It is collected by multiple types of sensors including eight 2K-resolution RGB cameras, Inertial Measurement Units (IMUs), and an RGB-D scanner. With the multi-modal data, we annotate human motion sequences with (d) 2D/3D joints, (e) the SMPL parameters, (f) the texture of each actor from a scanner, and fine-grained action types with textual descriptions.
Figure 2: Overview of Dataset Construction. (a) Input multimedia data collected from multimedia hardware, including the RGB-D scanner, multiple synchronized cameras, and full-body IMU suits; (b) Data pre-processing, including point cloud data conversion, 2D joint detection, 3D joints triangulation, and IMUs data alignment; (c) Textured 3D human models and approximated SMPL shape; and (d) Global motion recovery from inertial and visual results.
Figure 3: Example frame of motion recovery with inertial and visual data.
Figure 4: Fine-grained Actions. MINIONS contains 121 single-player actions and 25 multi-player actions, including common person-person and person-object interactive actions in daily life.
Figure 5: Qualitative results from single-subject motion capture collection. Results of 3D mesh and the corresponding re-projected full-body 2D joints from various views, visualized with white rendering for enhanced clarity.
...and 3 more figures

Motion Capture from Inertial and Vision Sensors

TL;DR

Abstract

Motion Capture from Inertial and Vision Sensors

Authors

TL;DR

Abstract

Table of Contents

Figures (8)