Table of Contents
Fetching ...

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang

TL;DR

TimeSoccer advances soccer commentary generation by delivering an end-to-end SDVC model for full-length 45-minute matches. It introduces MoFA-Select, a motion-aware frame compression module, along with a progressive training regime and position-embedding extrapolation to support long-context understanding. The approach jointly predicts event timestamps and captions in a single pass, enabling global temporal modeling and stronger semantic grounding, achieving state-of-the-art results on SoccerNet-Caption tasks and showing clear improvements in both localization accuracy and commentary quality. This work significantly improves realism and timing fidelity of automated soccer broadcasts, with practical implications for real-time analytics and broadcast augmentation.

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

TL;DR

TimeSoccer advances soccer commentary generation by delivering an end-to-end SDVC model for full-length 45-minute matches. It introduces MoFA-Select, a motion-aware frame compression module, along with a progressive training regime and position-embedding extrapolation to support long-context understanding. The approach jointly predicts event timestamps and captions in a single pass, enabling global temporal modeling and stronger semantic grounding, achieving state-of-the-art results on SoccerNet-Caption tasks and showing clear improvements in both localization accuracy and commentary quality. This work significantly improves realism and timing fidelity of automated soccer broadcasts, with practical implications for real-time analytics and broadcast augmentation.

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustration highlighting the global contextual understanding of TimeSoccer compared to SN-Caption mkhallati2023soccernet. Left panel: TimeSoccer correctly attributes a player substitution to a prior injury event; Right panel: TimeSoccer identifies a prior corner kick that occurred earlier in the match and appropriately continues the commentary with “[TEAM] can continue in their attacking effort.” The green (red) text indicates accurate (inaccurate) model-generated timestamps or descriptions.
  • Figure 2: Overview of TimeSoccer. Given a full 45-minute soccer video, frame features are extracted using an Image Encoder and Image Q-Former, while timestamps are obtained from the original frame sequence. Both features and timestamps are then processed by the MoFA-Select module, which (a) applies time-constrained K-Means clustering to segment frames, (b) computes motion-aware scores to allocate frame budgets $\tilde{R}_k$, and (c) merges redundant frames within each segment. The compressed features are passed through a sliding Video Q-Former to generate video tokens, which are concatenated with timestamp-based text tokens and the user query token before being input into the LLM for final prediction.
  • Figure 3: Quality comparison results on different methods. TimeSoccer demonstrates its advantages from multiple perspectives: (i) more accurate timestamp alignment; (ii) improved event descriptions; (iii) richer, more realistic commentary resembling professional broadcasts. Black text denotes outputs that are reasonably close to the ground truth in terms of either timing or semantics, while green highlights semantically accurate descriptions, and red marks incorrect or irrelevant content.