Table of Contents
Fetching ...

3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

Zhi-Yi Lin, Bofan Lyu, Judith Cueto Fernandez, Eline van der Kruk, Ajay Seth, Xucong Zhang

TL;DR

This work proposes a novel biomechanics-aware network that directly outputs 3D kinematics from two input views with consideration of biomechanical prior and spatio-temporal information and outperforms previous state-of-the-art methods when evaluated across multiple datasets.

Abstract

Accurate 3D kinematics estimation of human body is crucial in various applications for human health and mobility, such as rehabilitation, injury prevention, and diagnosis, as it helps to understand the biomechanical loading experienced during movement. Conventional marker-based motion capture is expensive in terms of financial investment, time, and the expertise required. Moreover, due to the scarcity of datasets with accurate annotations, existing markerless motion capture methods suffer from challenges including unreliable 2D keypoint detection, limited anatomic accuracy, and low generalization capability. In this work, we propose a novel biomechanics-aware network that directly outputs 3D kinematics from two input views with consideration of biomechanical prior and spatio-temporal information. To train the model, we create synthetic dataset ODAH with accurate kinematics annotations generated by aligning the body mesh from the SMPL-X model and a full-body OpenSim skeletal model. Our extensive experiments demonstrate that the proposed approach, only trained on synthetic data, outperforms previous state-of-the-art methods when evaluated across multiple datasets, revealing a promising direction for enhancing video-based human motion capture

3D Kinematics Estimation from Video with a Biomechanical Model and Synthetic Training Data

TL;DR

This work proposes a novel biomechanics-aware network that directly outputs 3D kinematics from two input views with consideration of biomechanical prior and spatio-temporal information and outperforms previous state-of-the-art methods when evaluated across multiple datasets.

Abstract

Accurate 3D kinematics estimation of human body is crucial in various applications for human health and mobility, such as rehabilitation, injury prevention, and diagnosis, as it helps to understand the biomechanical loading experienced during movement. Conventional marker-based motion capture is expensive in terms of financial investment, time, and the expertise required. Moreover, due to the scarcity of datasets with accurate annotations, existing markerless motion capture methods suffer from challenges including unreliable 2D keypoint detection, limited anatomic accuracy, and low generalization capability. In this work, we propose a novel biomechanics-aware network that directly outputs 3D kinematics from two input views with consideration of biomechanical prior and spatio-temporal information. To train the model, we create synthetic dataset ODAH with accurate kinematics annotations generated by aligning the body mesh from the SMPL-X model and a full-body OpenSim skeletal model. Our extensive experiments demonstrate that the proposed approach, only trained on synthetic data, outperforms previous state-of-the-art methods when evaluated across multiple datasets, revealing a promising direction for enhancing video-based human motion capture
Paper Structure (22 sections, 8 equations, 4 figures, 4 tables)

This paper contains 22 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The proposed biomechanics-aware network consists of a frame feature encoder and a spatio-temporal feature refinement module, which collectively infer 3D kinematics from two-view real-world video inputs. To train the model, we create a synthetic RGB video dataset ODAH by combining the kinematics skeleton from the OpenSim model, the body mesh from the SMPL-X model, and motions from the AMASS dataset to provide accurate ground truth data. Particularly, the end-to-end biomechanics-aware 3D kinematics estimation model is exclusively trained on this self-created synthetic data. Examples of real person images are from the OpenCap dataset uhlrich2022opencap, and faces were pixelated for privacy reasons.
  • Figure 2: Architecture of the frame feature encoder. Image features are extracted by a stacked hourglass network. The locations to extract the local image features are calculated by projecting the 3D sampled point on two views. Subsequently, point features are derived by concatenating the local image features and the 3D coordinates of the sampled 3D points. Finally, MLP encodes all point features into one compact frame feature.
  • Figure 3: The overview of the proposed spatio-temporal feature refinements. With the sequence of frame features from the frame feature encoder across frames, this refinement architecture treats the feature as a 2D image to process. This process results in a sequence of joint angles and a set of body segments scales with global optimization across frames.
  • Figure 4: An overview of our synthetic data generation pipeline. We first register the OpenSim skeletal model to the SMPL-X mesh; followed by optimizing the body shape and motion parameters of the mesh to fit the subject-specific OpenSim skeletal model and joint angles. Finally, we simulate real-world environments with scene and camera settings to render the synthetic video data.