Table of Contents
Fetching ...

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen

TL;DR

DeepSport tackles the lack of unified, end-to-end multimodal models for multi-task, multi-sport video reasoning by introducing an active, tool-augmented framework. It combines data distillation to produce Chain-of-Thought trajectories with a two-stage training regime—Supervised Fine-Tuning and agentic Reinforcement Learning using Group Relative Policy Optimization and a gated tool-use reward. The approach demonstrates state-of-the-art performance on a 6.7k-question, diverse sports benchmark and achieves efficient frame usage by iteratively querying a frame-extraction tool. This work establishes a foundation for domain-specific, video-based reasoning in sports, with potential for broader generalization to long-form video tasks. Limitations include data sparsity across sports and the need for improved temporal localization in tool grounding.

Abstract

Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

TL;DR

DeepSport tackles the lack of unified, end-to-end multimodal models for multi-task, multi-sport video reasoning by introducing an active, tool-augmented framework. It combines data distillation to produce Chain-of-Thought trajectories with a two-stage training regime—Supervised Fine-Tuning and agentic Reinforcement Learning using Group Relative Policy Optimization and a gated tool-use reward. The approach demonstrates state-of-the-art performance on a 6.7k-question, diverse sports benchmark and achieves efficient frame usage by iteratively querying a frame-extraction tool. This work establishes a foundation for domain-specific, video-based reasoning in sports, with potential for broader generalization to long-form video tasks. Limitations include data sparsity across sports and the need for improved temporal localization in tool grounding.

Abstract

Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos'' by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model's reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

Paper Structure

This paper contains 22 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the DeepSport framework. Trained through SFT and RL, DeepSport decides whether to utilize tools, supporting (a) passive single-pass inference, (b) single tool-use for two-turn interactions, and (c) iterative multi-turn reasoning via multiple tool calls. As shown on the Left, we categorize tasks into four core dimensions: Fine-Grained Recognition, Rule & Procedural Logic, Assessment & Coaching, and Live Commentary, covering diverse fine-grained sub-tasks (in red) across multiple sports.
  • Figure 2: DeepSport training overview. Given sport videos, we first perform data distillation with a teacher MLLM to construct DeepSport-CoT data and Supervised Fine-Tune a tool-augmented student model. We then further optimize the model with GRPO-based agentic reinforcement learning, where the agent iteratively calls a frame-extraction tool, produces chain-of-thought reasoning over new frames, and is guided by a reward manager that combines semantic accuracy, behavioral shaping, and format gating.
  • Figure 3: We present a comparison on the diving code detection task, where the model need capture each fine-grained movement. The Qwen2.5-VL-7B-Instruct model, relying on passive, single-pass processing of 16 sparsely sampled frames, misses the high-speed contact and made the wrong conclusion. Our DeepSport model, despite having only 7B parameters, active the multi-turn conversation by involking the frame_extraction_tool(35, 59) to retrieve second roun relevant frames. Using this new evidence, it correctly identifies the diving code successfully. This demonstrates the superiority of our active, iterative reasoning paradigm over static models.
  • Figure 4: Error analysis on 70 sampled failure cases. The dominance of Tool Grounding Failure (42.9%) highlights that precise temporal localization remains the primary bottleneck, followed by fine-grained Visual Hallucination (37.1%).