Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

Tutian Tang; Xingyu Ji; Yutong Li; MingHao Liu; Wenqiang Xu; Cewu Lu

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

Tutian Tang, Xingyu Ji, Yutong Li, MingHao Liu, Wenqiang Xu, Cewu Lu

TL;DR

Stereo-Inertial Poser is presented, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion and produces drift-free global translation under a long recording time and reduces foot-skating effects.

Abstract

Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

TL;DR

Abstract

Paper Structure (26 sections, 15 equations, 2 figures, 2 tables)

This paper contains 26 sections, 15 equations, 2 figures, 2 tables.

Introduction
Related Work
Single Modal Human Motion Capture
Vision-Based Methods
Inertial-Based Methods
Human Motion Capture with Visual-Inertial Fusion
Method
Overview
3D Pose and Body Shape Estimation
3D Metric Pose Estimation
Body Shape Estimation
Shape-Aware Visual-Inertial Fusion
Initial Global Translation Estimation
Initial Local Motion Estimation
IMU Encoding Network
...and 11 more sections

Figures (2)

Figure 1: (a) Our method estimates full human motion, including the local motions in the body's root coordinate and the global translation, with six inexpensive wearable IMUs and a stereo camera. (b) Previous methods in this track usually ignore the body shape (i.e., $\boldsymbol{\beta}$ parameters of the SMPL model), which may lead to the foot-skating effect or the incorrect global translation estimation. (c) Qualitative comparison with previous methods pippnprobustcap shows the proposed method can estimate metric-accurate, drift-free global translation in the world coordinate.
Figure 2: Our pipeline starts with one stereo camera and six IMUs. From the stereo image pairs, the 3D Pose Module predicts 3D keypoints in the root coordinate $\boldsymbol{p}_R$ and the world coordinate $\boldsymbol{p}_C$. The Body Pose Module estimates the body shape parameters $\beta$ (Sec. \ref{['sec:pose_shape']}). These initial measurements and predictions are encoded by three separate networks (Sec. \ref{['sec:method_fusion']}) into initial global translations $\boldsymbol{T}$, velocities $\Delta\boldsymbol{T}$, and 3D joint positions $\boldsymbol{J}_{IMU}$, $\boldsymbol{J}_{VIS}$ in the root coordinate. The FusionNet and RefineNet (Sec. \ref{['sec:method_fusion']}) fuse these intermediate results and further refine them into metric-accurate, shape-aware final results, including local motions $\boldsymbol{\Phi}$, global translations $\boldsymbol{T}$, and the auxiliary contact probability $\boldsymbol{q}$.

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

TL;DR

Abstract

Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

Authors

TL;DR

Abstract

Table of Contents

Figures (2)