Table of Contents
Fetching ...

CoMotion: Concurrent Multi-person 3D Motion

Alejandro Newell, Peiyun Hu, Lahav Lipson, Stephan R. Richter, Vladlen Koltun

TL;DR

CoMotion tackles online, multi-person 3D pose tracking from monocular video by updating all tracked SMPL poses directly from each incoming frame, rather than relying on frame-wise detections. It integrates a ConvNeXtV2-based encoder with a detection head and a recurrent pose-update module that uses cross-attention to refine all tracks in parallel, including those partially visible or occluded. Trained through a three-stage curriculum on a large, heterogeneous mix of pseudo-labeled datasets (enhanced by NLF pseudo-labels), CoMotion achieves state-of-the-art 3D pose estimation while delivering superior, faster multi-person tracking performance on benchmarks such as PoseTrack21. The method demonstrates strong temporal coherence and robustness to occlusion, while also highlighting areas for improvement, including long-range track stability and richer data to reduce identity switches.

Abstract

We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion

CoMotion: Concurrent Multi-person 3D Motion

TL;DR

CoMotion tackles online, multi-person 3D pose tracking from monocular video by updating all tracked SMPL poses directly from each incoming frame, rather than relying on frame-wise detections. It integrates a ConvNeXtV2-based encoder with a detection head and a recurrent pose-update module that uses cross-attention to refine all tracks in parallel, including those partially visible or occluded. Trained through a three-stage curriculum on a large, heterogeneous mix of pseudo-labeled datasets (enhanced by NLF pseudo-labels), CoMotion achieves state-of-the-art 3D pose estimation while delivering superior, faster multi-person tracking performance on benchmarks such as PoseTrack21. The method demonstrates strong temporal coherence and robustness to occlusion, while also highlighting areas for improvement, including long-range track stability and richer data to reduce identity switches.

Abstract

We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at https://github.com/apple/ml-comotion

Paper Structure

This paper contains 19 sections, 4 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: CoMotion tracks 3D poses online from monocular RGB video. Rather than detect new poses in each frame and associate them to existing tracks, CoMotion updates tracks directly from incoming image features. As a result, CoMotion keeps track of distinct individuals as they overlap in the camera frame (top) and occlude each other (bottom). Arrows highlight some points of interest.
  • Figure 2: Overview. CoMotion estimates 3D poses for all people in a frame. An image encoder produces image features $F^t$, which are passed through the detection module to identify potential new tracks. In parallel, the pose update module attends to $F^t$ to update the existing tracks from the previous timestep. Both outputs are compared to each other to decide whether to instantiate or remove any tracks. If a detection is flagged as a new track, it is passed through the update module before being added to the final output tracks for the current frame. The inset details the pose update module.
  • Figure 3: We compare predictions made by CoMotion and 4D Humans unrolled through time on a sample from PoseTrack. Due to making independent predictions per frame, we observe that 4D Humans occasionally makes abrupt changes to the estimated pose (see green track on the right).
  • Figure 4: Incorrect handling of missing annotations in PoseTrack18. Due to incomplete annotations in PoseTrack18, predicted tracks may be incorrectly regarded as "false positives". We show representative samples where annotations are green and "false positives" are red.
  • Figure 5: Incorrect handling of missing annotations in PoseTrack 21. PoseTrack21 addresses the incompleteness of PoseTrack18 annotations by providing 'ignore' regions to accompany the annotated tracks. For the frame on the left, the center image illustrates the annotation of the person in the center (shown in green) and a polygon defining the 'ignore' region in blue. The right image shows predicted tracks in red, which are still penalized as false positives by the PoseTrack21 evaluation code despite being contained in the 'ignore region'. This is a bug that we fix.
  • ...and 7 more figures