MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

He Zhang; Shenghao Ren; Haolei Yuan; Jianhui Zhao; Fan Li; Shuangpeng Sun; Zhenghao Liang; Tao Yu; Qiu Shen; Xun Cao

MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

He Zhang, Shenghao Ren, Haolei Yuan, Jianhui Zhao, Fan Li, Shuangpeng Sun, Zhenghao Liang, Tao Yu, Qiu Shen, Xun Cao

TL;DR

MMVP introduces a vision-pressure multimodal MoCap dataset that pairs RGBD video with dense plantar pressure to enable accurate, dense foot-contact annotations during large-range, fast motions. It contributes an RGBD-P SMPL fitting approach that leverages both depth and pressure signals to constrain pose, shape, and ground contact, and a monocular baseline, VP-MoCap, that predicts foot pressure and refines pose/translation using ground depth and contact cues. Across GT fitting, contact estimation, and pose-translation optimization, the methods outperform vision-only baselines and prior multimodal methods, demonstrating improved global translation stability and reduced foot sliding. The dataset and baselines are poised to advance MoCap research in AR/VR, biomechanics, and related domains by providing synchronized vision and pressure signals with precise contact annotations.

Abstract

Foot contact is an important cue for human motion capture, understanding, and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However, these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.

MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

TL;DR

Abstract

Paper Structure (27 sections, 12 equations, 11 figures, 5 tables)

This paper contains 27 sections, 12 equations, 11 figures, 5 tables.

Introduction
Related Work
Multimodal MoCap Datasets
Human Pose Estimation
MMVP Dataset
Data Collection and Pre-processing
Calculate Dense Foot Contact
RGBD-P SMPL Fitting
Method: VP-MoCap
FPP-Net
Pose and Translation Optimization
Experiments
Comparison of Ground Truth Registration
Comparison of Foot Contact Estimation
Pose and Translation Optimization Results
...and 12 more sections

Figures (11)

Figure 1: MMVP is a multimodal dataset that provides monocular RGBD video and accurate foot pressure (contact) of large-range and fast human motion.
Figure 2: Comparison of the foot contact and pose annotations in RICH huang2022rich (left) and MMVP (right). RICH fits the SMPL model for annotating the foot contact by distance thresholding, while MMVP incorporates dense pressure and contact directly for much more accurate SMPL fitting results. Note that due to the vision-only SMPL fitting error of RICH, the right foot, which is hanging in the air, was annotated as full contact with the ground.
Figure 3: Illustration of the dense foot contact annotating method. From left to right are the reference image, original pressure, normalized pressure, and dense contact.
Figure 4: VP-MoCap pipeline. Given an RGB sequence, RTMPose jiang2023rtmpose and CLIFF li2022cliff are applied to detect 2D keypoints and regress the initial pose. FPP-Net predicts foot pressure distribution and dense foot contact with keypoint sequence. Guided by foot contact, joint optimization is applied to estimate pose and trajectory. (Green represents the learning part, while orange represents the optimization part.)
Figure 5: Qualitative comparison of contact estimation. From left to right, the second row includes: foot pressure distribution predicted by our method, foot contact predicted by our method, and contact predicted by BSTRO[FT].
...and 6 more figures

MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

TL;DR

Abstract

MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors

Authors

TL;DR

Abstract

Table of Contents

Figures (11)