Table of Contents
Fetching ...

SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries

Hassan Mkhallati, Anthony Cioppa, Silvio Giancola, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR

This work introduces the problem of single-anchored dense video captioning (SDVC) for soccer broadcasts and releases SoccerNet-Caption, a large-scale dataset with 36,894 timestamped commentaries across 715.9 hours of play from 471 games. It defines the SDVC task and presents a two-stage baseline (spotting then captioning) to generate anchored, context-rich comments, along with an evaluation benchmark using metrics adapted for temporal and linguistic accuracy. The results demonstrate the feasibility of generating meaningful anchored commentaries while highlighting challenges in precise temporal anchoring and long-context captioning, pointing to significant potential for improving fan engagement and accessibility of soccer content. The dataset and benchmark lay groundwork for future improvements in video-language understanding in sports, with practical implications for broadcasters and audience reach.

Abstract

Soccer is more than just a game - it is a passion that transcends borders and unites people worldwide. From the roar of the crowds to the excitement of the commentators, every moment of a soccer match is a thrill. Yet, with so many games happening simultaneously, fans cannot watch them all live. Notifications for main actions can help, but lack the engagement of live commentary, leaving fans feeling disconnected. To fulfill this need, we propose in this paper a novel task of dense video captioning focusing on the generation of textual commentaries anchored with single timestamps. To support this task, we additionally present a challenging dataset consisting of almost 37k timestamped commentaries across 715.9 hours of soccer broadcast videos. Additionally, we propose a first benchmark and baseline for this task, highlighting the difficulty of temporally anchoring commentaries yet showing the capacity to generate meaningful commentaries. By providing broadcasters with a tool to summarize the content of their video with the same level of engagement as a live game, our method could help satisfy the needs of the numerous fans who follow their team but cannot necessarily watch the live game. We believe our method has the potential to enhance the accessibility and understanding of soccer content for a wider audience, bringing the excitement of the game to more people.

SoccerNet-Caption: Dense Video Captioning for Soccer Broadcasts Commentaries

TL;DR

This work introduces the problem of single-anchored dense video captioning (SDVC) for soccer broadcasts and releases SoccerNet-Caption, a large-scale dataset with 36,894 timestamped commentaries across 715.9 hours of play from 471 games. It defines the SDVC task and presents a two-stage baseline (spotting then captioning) to generate anchored, context-rich comments, along with an evaluation benchmark using metrics adapted for temporal and linguistic accuracy. The results demonstrate the feasibility of generating meaningful anchored commentaries while highlighting challenges in precise temporal anchoring and long-context captioning, pointing to significant potential for improving fan engagement and accessibility of soccer content. The dataset and benchmark lay groundwork for future improvements in video-language understanding in sports, with practical implications for broadcasters and audience reach.

Abstract

Soccer is more than just a game - it is a passion that transcends borders and unites people worldwide. From the roar of the crowds to the excitement of the commentators, every moment of a soccer match is a thrill. Yet, with so many games happening simultaneously, fans cannot watch them all live. Notifications for main actions can help, but lack the engagement of live commentary, leaving fans feeling disconnected. To fulfill this need, we propose in this paper a novel task of dense video captioning focusing on the generation of textual commentaries anchored with single timestamps. To support this task, we additionally present a challenging dataset consisting of almost 37k timestamped commentaries across 715.9 hours of soccer broadcast videos. Additionally, we propose a first benchmark and baseline for this task, highlighting the difficulty of temporally anchoring commentaries yet showing the capacity to generate meaningful commentaries. By providing broadcasters with a tool to summarize the content of their video with the same level of engagement as a live game, our method could help satisfy the needs of the numerous fans who follow their team but cannot necessarily watch the live game. We believe our method has the potential to enhance the accessibility and understanding of soccer content for a wider audience, bringing the excitement of the game to more people.
Paper Structure (9 sections, 2 equations, 7 figures, 7 tables)

This paper contains 9 sections, 2 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: SoccerNet-Caption. We provide a large-scale dataset for Single-anchored Dense Video Captioning (SDVC) in untrimmed soccer broadcast videos. Our SoccerNet-Caption dataset is composed of $36{,}894$ textual commentaries, temporally anchored within $715.9$ hours of soccer broadcasts. The comments describe the events occurring in the soccer game with rich factual, emotional, and sensational content.
  • Figure 2: Comment anonymization. We provide three versions for each comment. The original commentary, an identified version where each player is associated with a unique id token, and an anonymized version where each entity is replaced by a specific token: [TEAM], [COACH], [REFEREE], and [PLAYER].
  • Figure 3: Distribution of the comments. Most comments are uniformly scattered in each half-time, except at the start of the game where a peak is followed by fewer comments for $10$ minutes.
  • Figure 4: Distribution of the number of words per comment This plot shows that the number of words per comment follows a long tail distribution with $21.38$ words on average.
  • Figure 5: Distribution of the most common words. The most frequent words are the names of the teams and the players, followed by words semantically related to soccer verbs and soccer elements. There is a high imbalance in the distribution.
  • ...and 2 more figures