Table of Contents
Fetching ...

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

Zhaoyu Liu, Xi Weng, Lianyu Hu, Zhe Hou, Kan Jiang, Jin Song Dong, Yang Liu

Abstract

Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics. Our dataset and code are publicly available at https://github.com/LZYAndy/TennisExpert.

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

Abstract

Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics. Our dataset and code are publicly available at https://github.com/LZYAndy/TennisExpert.
Paper Structure (46 sections, 3 equations, 6 figures, 4 tables)

This paper contains 46 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Existing sports commentary datasets primarily provide descriptive narration of immediate actions and visual states (top and middle rows). In contrast, TennisVL provides expert-level analytical commentary (bottom row), emphasizing tactical intent, match momentum, and performance evaluation beyond surface-level descriptions.
  • Figure 2: Dataset statistics of clip duration, commentary length, and tournaments.
  • Figure 3: Overall architecture of TennisExpert. Given a sequence of rallies, the current rally $V_t$ is processed by a video semantic parser to obtain structured metadata $M_t$, including scoreboard state ($s_t$), fine-grained event sequence ($e_t$), and spatial object detections ($o_t$). A hierarchical memory mechanism maintains match context: short-term memory ($S$) stores recent rally representations in a FIFO buffer, while long-term memory ($L$) consolidates past events into cumulative match statistics. The tactic-aware MLLM (Qwen3-VL) integrates visual tokens, structured metadata, and memory context to generate expert-level commentary $C_t$, which is fed back into short-term memory to preserve match momentum for subsequent rallies.
  • Figure 4: Semantic parser visualization on various court surfaces. Court corners (red), ball (green), bounces (yellow), and player boxes are projected onto broadcast frames.
  • Figure 5: Qualitative comparison of commentary generation. Our method produces expert-level commentary across multiple dimensions: (a) tactical depth, (b) temporal momentum, (c) professional terminology, and (d) strategic prediction.
  • ...and 1 more figures