Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

Nicholas Milef; John Keyser; Shu Kong

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

Nicholas Milef, John Keyser, Shu Kong

TL;DR

This work tackles Unstructured Unlabeled Optical (UUO) mocap, where markers are placed without a fixed template and remain unlabeled. It introduces a three-module pipeline that combines monocular video-derived SMPL priors, marker-part matching, and a staged mocap solver to recover pose, shape, and global motion from UUO markers. The approach uses multi-hypothesis rotation and a sequence of optimization steps to robustly align markers to body parts, identify marker-vertex correspondences, and refine inverse kinematics. Results on three benchmark datasets show substantial improvements over both marker-only and video-only baselines, including strong performance in partial-body reconstruction, indicating practical gains for biomechanics and other applications with flexible marker layouts.

Abstract

Optical motion capture (mocap) requires accurately reconstructing the human body from retroreflective markers, including pose and shape. In a typical mocap setting, marker labeling is an important but tedious and error-prone step. Previous work has shown that marker labeling can be automated by using a structured template defining specific marker placements, but this places additional recording constraints. We propose to relax these constraints and solve for Unstructured Unlabeled Optical (UUO) mocap. Compared to the typical mocap setting that either labels markers or places them w.r.t a structured layout, markers in UUO mocap can be placed anywhere on the body and even on one specific limb (e.g., right leg for biomechanics research), hence it is of more practical significance. It is also more challenging. To solve UUO mocap, we exploit a monocular video captured by a single RGB camera, which does not require camera calibration. On this video, we run an off-the-shelf method to reconstruct and track a human individual, giving strong visual priors of human body pose and shape. With both the video and UUO markers, we propose an optimization pipeline towards marker identification, marker labeling, human pose estimation, and human body reconstruction. Our technical novelties include multiple hypothesis testing to optimize global orientation, and marker localization and marker-part matching to better optimize for body surface. We conduct extensive experiments to quantitatively compare our method against state-of-the-art approaches, including marker-only mocap and video-only human body/shape reconstruction. Experiments demonstrate that our method resoundingly outperforms existing methods on three established benchmark datasets for both full-body and partial-body reconstruction.

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

TL;DR

Abstract

Paper Structure (46 sections, 8 equations, 12 figures, 6 tables)

This paper contains 46 sections, 8 equations, 12 figures, 6 tables.

Introduction
Related Work
Statistical Human Models
Motion Capture Solving
Automatic Marker Labeling
Monocular Video Mocap
Problem Definition and Methodology
Problem Definition
The Proposed UUO Mocap Method
Monocular Reconstruction from Video
Marker-Part Matching
Step 1: marker segmentation
Step 2: multiple hypothesis testing for part localization
Mocap Solving
Stage 1: multiple hypothesis testing for root rotation
...and 31 more sections

Figures (12)

Figure 1: The proposed pipeline of our UUO mocap solver consists of three modules (cf. details in Section \ref{['ssec:methodology']}). Our method takes as input monocular video and UUO markers to jointly predict marker labels, pose, and body shape. First, we use an off-the-shelf method (HMR2.0 goel2023humans) to generate a human prior from the video. Then, we segment the 3D mocap markers to estimate the number of bones that need to be reconstructed. Then, we search for the best-fitting body part. Finally, we solve for the pose and body shape through a novel optimization process.
Figure 2: Our Marker-Part Matching first computes the standard deviation of distances between every other marker across all frames, then uses them to construct an affinity matrix to clustering markers into groups, and conducts hypothesis testing to select the best match that produces the minimum fitting error w.r.t the initial body model obtained from the monocular video.
Figure 3: Qualitative results for the validation split of the MOYO dataset tripathi20233d. This dataset is challenging that has unique and difficult poses. Furthermore, markers are densely packed, which can present ambiguity for labeling. SOMA struggles to accurately label the markers, resulting in poor quality reconstruction. Our method produces better visual results.
Figure 4: Qualitative results for the validation split of the UMPM dataset HICV11:UMPM. HMR2.0+RR contains alignment issues, and SOMA produces an incorrect joint position at the right knee. In contrast, our method produces better visual results.
Figure 5: Qualitative results for the validation split of the CMU Kitchen dataset de2009guide. Our approach does aligns better to the markers compared to HMR 2.0+RR and produces a closer body shape and poser to the reference compared to SOMA.
...and 7 more figures

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

TL;DR

Abstract

Towards Unstructured Unlabeled Optical Mocap: A Video Helps!

Authors

TL;DR

Abstract

Table of Contents

Figures (12)