Table of Contents
Fetching ...

SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

Haotian Xia, Haonan Ge, Junbo Zou, Hyun Woo Choi, Xuebin Zhang, Danny Suradja, Botao Rui, Ethan Tran, Wendy Jin, Zhen Ye, Xiyang Lin, Christopher Lai, Shengjie Zhang, Junwen Miao, Shichao Chen, Rhys Tracy, Vicente Ordonez, Weining Shen, Hanjie Chen

TL;DR

SportR addresses the gap in evaluating fine-grained, rule-based multimodal reasoning across multiple sports by introducing image and video QA benchmarks grounded with human CoT explanations and explicit visual grounding. The benchmark uses 5,017 images and 2,101 videos across five sports, with 7,118 CoT rationales and over 20,000 QA pairs covering 50 foul types and 12 tactics. It enables a progressive QA hierarchy and a novel grounding task to tie reasoning to precise visual evidence, and it provides a first cross-sport dataset for grounding-based evaluation. Experiments show that state-of-the-art MLLMs struggle on SportR's tasks, but supervised fine-tuning and GRPO-based RL improve performance substantially, especially on image-based reasoning, while grounding remains the main challenge. SportR thus offers a valuable resource to push toward robust, explainable, cross-sport multimodal reasoning in LLMs.

Abstract

Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

SportR: A Benchmark for Multimodal Large Language Model Reasoning in Sports

TL;DR

SportR addresses the gap in evaluating fine-grained, rule-based multimodal reasoning across multiple sports by introducing image and video QA benchmarks grounded with human CoT explanations and explicit visual grounding. The benchmark uses 5,017 images and 2,101 videos across five sports, with 7,118 CoT rationales and over 20,000 QA pairs covering 50 foul types and 12 tactics. It enables a progressive QA hierarchy and a novel grounding task to tie reasoning to precise visual evidence, and it provides a first cross-sport dataset for grounding-based evaluation. Experiments show that state-of-the-art MLLMs struggle on SportR's tasks, but supervised fine-tuning and GRPO-based RL improve performance substantially, especially on image-based reasoning, while grounding remains the main challenge. SportR thus offers a valuable resource to push toward robust, explainable, cross-sport multimodal reasoning in LLMs.

Abstract

Deeply understanding sports requires an intricate blend of fine-grained visual perception and rule-based reasoning - a challenge that pushes the limits of current multimodal models. To succeed, models must master three critical capabilities: perceiving nuanced visual details, applying abstract sport rule knowledge, and grounding that knowledge in specific visual evidence. Current sports benchmarks either cover single sports or lack the detailed reasoning chains and precise visual grounding needed to robustly evaluate these core capabilities in a multi-sport context. To address this gap, we introduce SportR, the first multi-sports large-scale benchmark designed to train and evaluate MLLMs on the fundamental reasoning required for sports intelligence. Our benchmark provides a dataset of 5,017 images and 2,101 videos. To enable granular evaluation, we structure our benchmark around a progressive hierarchy of question-answer (QA) pairs designed to probe reasoning at increasing depths - from simple infraction identification to complex penalty prediction. For the most advanced tasks requiring multi-step reasoning, such as determining penalties or explaining tactics, we provide 7,118 high-quality, human-authored Chain of Thought (CoT) annotations. In addition, our benchmark incorporates both image and video modalities and provides manual bounding box annotations to test visual grounding in the image part directly. Extensive experiments demonstrate the profound difficulty of our benchmark. State-of-the-art baseline models perform poorly on our most challenging tasks. While training on our data via Supervised Fine-Tuning and Reinforcement Learning improves these scores, they remain relatively low, highlighting a significant gap in current model capabilities. SportR presents a new challenge for the community, providing a critical resource to drive future research in multimodal sports reasoning.

Paper Structure

This paper contains 33 sections, 3 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Overview of the SportR benchmark. It consists of two parts, SportsImage and SportsVideo, covering 50 rule infraction categories and 12 different kinds of tactics.
  • Figure 2: SportR overview. Left: A three-level pyramid frames our evaluation scope—perception (base and well established), fundamental sports reasoning (our focus), and elite scenarios (out of scope). Right: We instantiate a 13-question hierarchy with concrete examples: Q1–Q7 (SportsImage) cover infraction detection, type, penalty reasoning, explanation, grounding by box coordinates, and offensive/defensive tactics; Q8–Q13 (SportsVideo) mirror these tasks in the temporal domain.
  • Figure 3: System Prompt format for Answer Generation
  • Figure 4: System Prompt format for Answer Generation
  • Figure 5: System Prompt format for Q1 - Q7 SFT training set
  • ...and 15 more figures