CoMotion: Concurrent Multi-person 3D Motion
Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, Vladlen Koltun
TL;DR
CoMotion tackles online, multi-person 3D pose tracking from monocular video by updating all tracked SMPL poses directly from each incoming frame, rather than relying on frame-wise detections. It integrates a ConvNeXtV2-based encoder with a detection head and a recurrent pose-update module that uses cross-attention to refine all tracks in parallel, including those partially visible or occluded. Trained through a three-stage curriculum on a large, heterogeneous mix of pseudo-labeled datasets (enhanced by NLF pseudo-labels), CoMotion achieves state-of-the-art 3D pose estimation while delivering superior, faster multi-person tracking performance on benchmarks such as PoseTrack21. The method demonstrates strong temporal coherence and robustness to occlusion, while also highlighting areas for improvement, including long-range track stability and richer data to reduce identity switches.
Abstract
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion
