Table of Contents
Fetching ...

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs

Yiming Bao, Xu Zhao, Dahong Qian

TL;DR

Monocular 3D HPE suffers depth ambiguity and occlusion challenges. This work introduces Real-Time Optimization and Fusion (RTOF), which lifts 2D poses from monocular video, fuses raw IMU data in a kinematic space via an Inertial-Guided Inverse Kinematic (IGIK) layer, and applies fragment-based temporal optimization with the energy $E(X^{frag})=k_VE_V+k_IE_I$, where $E_V=0\sum_i\sum_j ||P X_{i,j}^{frag}-x_{i,j}||^2$ and $E_I=k_AE_A+k_BE_B+k_SE_S$, to produce smooth, biomechanically plausible 3D motion. On Total Capture, MPJPE improves from $64.6\mathrm{mm}$ to $33.7\mathrm{mm}$ with SF and fusion, and to $23.2\mathrm{mm}$ with GT 2D poses, showing competitiveness with multi-view methods; Human3.6M results confirm strong temporal accuracy for the visual-only case. The approach enables real-time 3D HPE with sparse IMUs, robust to occlusion and depth ambiguity, with potential applications in outdoor, rehabilitation, and action recognition scenarios.

Abstract

Temporal 3D human pose estimation from monocular videos is a challenging task in human-centered computer vision due to the depth ambiguity of 2D-to-3D lifting. To improve accuracy and address occlusion issues, inertial sensor has been introduced to provide complementary source of information. However, it remains challenging to integrate heterogeneous sensor data for producing physically rational 3D human poses. In this paper, we propose a novel framework, Real-time Optimization and Fusion (RTOF), to address this issue. We first incorporate sparse inertial orientations into a parametric human skeleton to refine 3D poses in kinematics. The poses are then optimized by energy functions built on both visual and inertial observations to reduce the temporal jitters. Our framework outputs smooth and biomechanically plausible human motion. Comprehensive experiments with ablation studies demonstrate its rationality and efficiency. On Total Capture dataset, the pose estimation error is significantly decreased compared to the baseline method.

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs

TL;DR

Monocular 3D HPE suffers depth ambiguity and occlusion challenges. This work introduces Real-Time Optimization and Fusion (RTOF), which lifts 2D poses from monocular video, fuses raw IMU data in a kinematic space via an Inertial-Guided Inverse Kinematic (IGIK) layer, and applies fragment-based temporal optimization with the energy , where and , to produce smooth, biomechanically plausible 3D motion. On Total Capture, MPJPE improves from to with SF and fusion, and to with GT 2D poses, showing competitiveness with multi-view methods; Human3.6M results confirm strong temporal accuracy for the visual-only case. The approach enables real-time 3D HPE with sparse IMUs, robust to occlusion and depth ambiguity, with potential applications in outdoor, rehabilitation, and action recognition scenarios.

Abstract

Temporal 3D human pose estimation from monocular videos is a challenging task in human-centered computer vision due to the depth ambiguity of 2D-to-3D lifting. To improve accuracy and address occlusion issues, inertial sensor has been introduced to provide complementary source of information. However, it remains challenging to integrate heterogeneous sensor data for producing physically rational 3D human poses. In this paper, we propose a novel framework, Real-time Optimization and Fusion (RTOF), to address this issue. We first incorporate sparse inertial orientations into a parametric human skeleton to refine 3D poses in kinematics. The poses are then optimized by energy functions built on both visual and inertial observations to reduce the temporal jitters. Our framework outputs smooth and biomechanically plausible human motion. Comprehensive experiments with ablation studies demonstrate its rationality and efficiency. On Total Capture dataset, the pose estimation error is significantly decreased compared to the baseline method.
Paper Structure (16 sections, 8 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 8 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the proposed RTOF framework. Three modules are successively inferred. First, 2D pose sequence is lifted to 3D space with a receptive field of $n$ frames. The single frame sensor fusion is then conducted to refine the lifted 3D pose using calibrated and aligned IMU raw data. Finally, fragments with $N$ frames are cropped from the refined 3D sequence and optimized by both visual and inertial energy functions.
  • Figure 2: An illustration of the proposed fragment cropping approach with $N=4$. In this case, the crop step and the length for the final average are both $2$. The cropped fragments are optimized by the temporal optimization model and are then utilized to calculate the output 3D pose sequence.
  • Figure 3: The illustration of the estimated 3D human poses (red skeletons) compared with ground truth (green skeletons).
  • Figure 4: The quantitative ablation results on Total Capture dataset. The per joint position error curves and rotation angle error curves on all frames from four sequences with different motion characteristics are drawn.