Table of Contents
Fetching ...

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

Hanz Cuevas-Velasquez, Anastasios Yiannakidis, Soyong Shin, Giorgio Becherini, Markus Höschle, Joachim Tesch, Taylor Obersat, Tsvetelina Alexiadis, Eni Halilaj, Michael J. Black

TL;DR

<3-5 sentence high-level summary> MAMMA introduces a markerless, multi-view pipeline that recovers SMPL-X parameters for two-person interactions by predicting dense, 2D surface landmarks conditioned on segmentation masks. The method employs a transformer-based landmark network with per-landmark queries, followed by a multi-stage SMPL-X fitting and a robust multiview correspondence framework, enabling accurate pose, shape, and contact handling without markers. The authors address data scarcity with MammaSyn, a large synthetic dataset, and validate on real multi-view benchmarks (MammaEval), achieving competitive accuracy to Vicon-based motion capture while significantly reducing setup time. The work also provides extensive datasets, training code, and evaluation protocols to support future research in markerless, contact-aware motion capture.

Abstract

We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.

MAMMA: Markerless & Automatic Multi-Person Motion Action Capture

TL;DR

<3-5 sentence high-level summary> MAMMA introduces a markerless, multi-view pipeline that recovers SMPL-X parameters for two-person interactions by predicting dense, 2D surface landmarks conditioned on segmentation masks. The method employs a transformer-based landmark network with per-landmark queries, followed by a multi-stage SMPL-X fitting and a robust multiview correspondence framework, enabling accurate pose, shape, and contact handling without markers. The authors address data scarcity with MammaSyn, a large synthetic dataset, and validate on real multi-view benchmarks (MammaEval), achieving competitive accuracy to Vicon-based motion capture while significantly reducing setup time. The work also provides extensive datasets, training code, and evaluation protocols to support future research in markerless, contact-aware motion capture.

Abstract

We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video of two-person interaction sequences. Traditional motion-capture systems rely on physical markers. Although they offer high accuracy, their requirements of specialized hardware, manual marker placement, and extensive post-processing make them costly and time-consuming. Recent learning-based methods attempt to overcome these limitations, but most are designed for single-person capture, rely on sparse keypoints, or struggle with occlusions and physical interactions. In this work, we introduce a method that predicts dense 2D contact-aware surface landmarks conditioned on segmentation masks, enabling person-specific correspondence estimation even under heavy occlusion. We employ a novel architecture that exploits learnable queries for each landmark. We demonstrate that our approach can handle complex person--person interaction and offers greater accuracy than existing methods. To train our network, we construct a large, synthetic multi-view dataset combining human motions from diverse sources, including extreme poses, hand motions, and close interactions. Our dataset yields high-variability synthetic sequences with rich body contact and occlusion, and includes SMPL-X ground-truth annotations with dense 2D landmarks. The result is a system capable of capturing human motion without the need for markers. Our approach offers competitive reconstruction quality compared to commercial marker-based motion-capture solutions, without the extensive manual cleanup. Finally, we address the absence of common benchmarks for dense-landmark prediction and markerless motion capture by introducing two evaluation settings built from real multi-view sequences. We will release our dataset, benchmark, method, training code, and pre-trained model weights for research purposes.

Paper Structure

This paper contains 30 sections, 2 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: From dense surface landmarks to SMPL-X: MAMMA accurately reconstructs pose and shape from synchronized multi-view videos.
  • Figure 2: Cropped samples from MammaSyn dataset.
  • Figure 3: 512 Landmarks sampled from the SMPL-X body.
  • Figure 4: MammaNet. The input to the network is the image and mask. It predicts per landmark visibility probability $p$ (green is visible, red not visible), landmark locations $\mu$, uncertainties $\sigma$ (red means highly uncertain), person--person $pc$ and floor contact $fl$ probabilities (red means no contact and green contact).
  • Figure 5: Comparison on extreme poses. Ground-truth landmarks are shown in green. For each prediction, landmarks are color-coded: red indicates higher pixel error, green indicates lower pixel error. We compare networks trained on BEDLAM.
  • ...and 11 more figures