Table of Contents
Fetching ...

SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding

Marc Gutiérrez-Pérez, Antonio Agudo

TL;DR

This work introduces SoccerNet-v3D and ISSIA-3D, two publicly available datasets designed for 3D scene understanding in soccer broadcasts by leveraging field-line-based camera calibration and multi-view synchronization to enable 3D ball localization through triangulation. It proposes a monocular 3D ball localization task grounded in multi-view triangulation, along with calibration and reprojection metrics to assess annotation quality, and a bounding-box optimization method to ensure consistency with the 3D scene. The datasets are built by aligning synchronized main-camera and replay frames (SoccerNet-v3D) and six static synchronized cameras (ISSIA-3D), with robust calibration via PnLCalib and 3D localization validated against ground-truth ball positions. Experimental results show that optimized bounding boxes improve 2D detection and 3D localization accuracy, establishing new benchmarks for monocular 3D ball localization and enabling further research in 3D soccer scene understanding and tracking.

Abstract

Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball's position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.

SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding

TL;DR

This work introduces SoccerNet-v3D and ISSIA-3D, two publicly available datasets designed for 3D scene understanding in soccer broadcasts by leveraging field-line-based camera calibration and multi-view synchronization to enable 3D ball localization through triangulation. It proposes a monocular 3D ball localization task grounded in multi-view triangulation, along with calibration and reprojection metrics to assess annotation quality, and a bounding-box optimization method to ensure consistency with the 3D scene. The datasets are built by aligning synchronized main-camera and replay frames (SoccerNet-v3D) and six static synchronized cameras (ISSIA-3D), with robust calibration via PnLCalib and 3D localization validated against ground-truth ball positions. Experimental results show that optimized bounding boxes improve 2D detection and 3D localization accuracy, establishing new benchmarks for monocular 3D ball localization and enabling further research in 3D soccer scene understanding and tracking.

Abstract

Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball's position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Ball triangularization. Given two cameras from different viewpoints, a 3D point $\mathbf{p}_{12}$ can be estimated from the corresponding image points $\mathbf{\bar{p}}_1$ and $\mathbf{\bar{p}}_2$ in cameras 1 and 2, respectively, using the camera projection matrices $\mathbf{P}_1$ and $\mathbf{P}_2$. However, since $\mathbf{p}_{12}$ is only an optimal solution, its reprojected image points, $\mathbf{\bar{p}}'_1$ and $\mathbf{\bar{p}}'_2$, do not exactly match the original points $\mathbf{\bar{p}_1}$ and $\mathbf{\bar{p}}_2$. $e_{12}$ and $e_{21}$ are the reprojection errors for cameras 1 and 2, respectively. $\beta$ corresponds to the parallax angle defined by the intersection of rays $\mathbf{d}_1$ and $\mathbf{d}_2$.
  • Figure 2: SoccerNet-v3D dataset generation pipeline. A main camera frame is paired with its corresponding synchronized replay frames, where blue dots indicate the original SoccerNet-v3 cioppa2022scaling field-line annotations. The PnLCalib gutierrez4998149pnlcalib calibration pipeline is used to recover camera parameters $\{\mathbf{K},\mathbf{R},\mathbf{t}\}$. Calibration quality is assessed using $\text{JaC}_\gamma$, with a threshold of $\text{JaC}_{0.5\%}=0.75$ to determine whether frames qualify as part of the multi-view system. Red lines represent the field projection obtained from the estimated calibration. Finally, 2D ball annotations are fused through triangulation to estimate 3D ball positions, while original bounding boxes are optimized to ensure consistency with the 3D scene, with the original SoccerNet-v3 cioppa2022scaling and optimized bounding boxes represented in red and blue, respectively.
  • Figure 3: Available multi-view systems with ball bounding box annotation on the SoccerNet-v3 cioppa2022scaling dataset in terms of the $\text{JaC}_\gamma$ threshold for $\gamma=\{0.5,1,2\}\%$. Red dashed line represents the selected threshold corresponding to $\text{JaC}_{0.5\%} > 0.75$.
  • Figure 4: A visual comparison of the precise manually annotated ball bounding boxes (blue), the original SoccerNet-v3 cioppa2022scaling ball bounding boxes (red), and the optimized bounding boxes (orange) obtained through the optimization pipeline described in \ref{['sec:bboxopt']}.
  • Figure 5: Precision-threshold curve for 3D ball localization in SoccerNet-v3D (left) and ISSIA-3D (right). The red dashed line represents the fixed threshold $\tau_{3D}$ used for 3D localization results comparison.
  • ...and 1 more figures