TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You; Wenxuan Huang; Xinni Xie; Xiangyi Wei; Bangyan Li; Shaohui Lin; Yang Li; Changbo Wang

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Ling You, Wenxuan Huang, Xinni Xie, Xiangyi Wei, Bangyan Li, Shaohui Lin, Yang Li, Changbo Wang

TL;DR

TimeSoccer advances soccer commentary generation by delivering an end-to-end SDVC model for full-length 45-minute matches. It introduces MoFA-Select, a motion-aware frame compression module, along with a progressive training regime and position-embedding extrapolation to support long-context understanding. The approach jointly predicts event timestamps and captions in a single pass, enabling global temporal modeling and stronger semantic grounding, achieving state-of-the-art results on SoccerNet-Caption tasks and showing clear improvements in both localization accuracy and commentary quality. This work significantly improves realism and timing fidelity of automated soccer broadcasts, with practical implications for real-time analytics and broadcast augmentation.

Abstract

Soccer is a globally popular sporting event, typically characterized by long matches and distinctive highlight moments. Recent advances in Multimodal Large Language Models (MLLMs) offer promising capabilities in temporal grounding and video understanding, soccer commentary generation often requires precise temporal localization and semantically rich descriptions over long-form video. However, existing soccer MLLMs often rely on the temporal a priori for caption generation, so they cannot process the soccer video end-to-end. While some traditional approaches follow a two-step paradigm that is complex and fails to capture the global context to achieve suboptimal performance. To solve the above issues, we present TimeSoccer, the first end-to-end soccer MLLM for Single-anchor Dense Video Captioning (SDVC) in full-match soccer videos. TimeSoccer jointly predicts timestamps and generates captions in a single pass, enabling global context modeling across 45-minute matches. To support long video understanding of soccer matches, we introduce MoFA-Select, a training-free, motion-aware frame compression module that adaptively selects representative frames via a coarse-to-fine strategy, and incorporates complementary training paradigms to strengthen the model's ability to handle long temporal sequences. Extensive experiments demonstrate that our TimeSoccer achieves State-of-The-Art (SoTA) performance on the SDVC task in an end-to-end form, generating high-quality commentary with accurate temporal alignment and strong semantic relevance.

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

TL;DR

Abstract

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)