Table of Contents
Fetching ...

WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

Tianjian Jiang, Johsan Billingham, Sebastian Müksch, Juan Zarate, Nicolas Evans, Martin R. Oswald, Marc Pollefeys, Otmar Hilliges, Manuel Kaufmann, Jie Song

TL;DR

WorldPose introduces the first large-scale, in-the-wild multi-person 3D pose dataset with global trajectories captured during the FIFA World Cup, leveraging fixed stadium cameras and a moving broadcasting camera. A three-stage pipeline—static camera calibration, 3D pose estimation with SMPL fitting, and broadcasting-camera calibration—delivers accurate global poses, achieving around 8 cm average joint error against a Vicon reference in a controlled subset. The dataset contains about 2.5 million 3D SMPL poses across 150k frames and over 120 km of total player travel, enabling rigorous benchmarking of global-pose methods and sports analytics tasks. Experimental results show that current state-of-the-art global-pose approaches struggle with large outdoor spaces and inter-player pose coordination, underscoring WorldPose’s value as a challenging, high-signal benchmark and a resource for advancing techniques in multi-person pose estimation and motion analysis.

Abstract

We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres. We then leverage the captured players' motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises more than 80 sequences with approx 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.

WorldPose: A World Cup Dataset for Global 3D Human Pose Estimation

TL;DR

WorldPose introduces the first large-scale, in-the-wild multi-person 3D pose dataset with global trajectories captured during the FIFA World Cup, leveraging fixed stadium cameras and a moving broadcasting camera. A three-stage pipeline—static camera calibration, 3D pose estimation with SMPL fitting, and broadcasting-camera calibration—delivers accurate global poses, achieving around 8 cm average joint error against a Vicon reference in a controlled subset. The dataset contains about 2.5 million 3D SMPL poses across 150k frames and over 120 km of total player travel, enabling rigorous benchmarking of global-pose methods and sports analytics tasks. Experimental results show that current state-of-the-art global-pose approaches struggle with large outdoor spaces and inter-player pose coordination, underscoring WorldPose’s value as a challenging, high-signal benchmark and a resource for advancing techniques in multi-person pose estimation and motion analysis.

Abstract

We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres. We then leverage the captured players' motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises more than 80 sequences with approx 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes.
Paper Structure (45 sections, 15 equations, 16 figures, 3 tables)

This paper contains 45 sections, 15 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: We leverage multi-view cameras to curate WorldPose, a comprehensive dataset designed for multi-person 3D human pose estimation with global trajectories.
  • Figure 2: Sample images of the dataset. The first row displays the camera view and overlay, and the second row presents a novel 3D view to help readers understand the 3D locations of the subjects.
  • Figure 3: Method overview (from left to right): We take as input 16-18 high-resolution videos from statically placed cameras inside the stadium. The static cameras are calibrated by using hand-picked 2D points and photometric information (\ref{['sec:static_cam_calibration']}). This yields camera calibrations $\boldsymbol{\Lambda}_c$ for every camera $c$, which we then use to triangulate and track 3D poses of each player (\ref{['sec:pose_estimation']}). We fit SMPL to the 3D pose data obtaining parameters $\boldsymbol{\Omega}$. Finally, we calibrate a moving broadcasting camera to align the estimated 3D poses with broadcasted TV footage (\ref{['sec:broadcast']}). The method outputs 3D SMPL pose and shape parameters $\boldsymbol{\Omega}$ of all soccer players, including their global trajectory, and accurate calibrations of the broadcast cameras $\boldsymbol{\Lambda}_b$ with high-quality player pose reprojections. Stadium image sourced from FreePik.
  • Figure 4: Vicon setup at night with 6 subjects playing in the penalty box. This data is used for evaluation purposes.
  • Figure 5: Visualization of broadcasting camera calibration before (left) and after (right) refinement with 3D poses (\ref{['sec:broadcast']}). Note the improved reprojections in the zoom-ins.
  • ...and 11 more figures