Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang; Yuqing Shao; Duxiu Huang; Linfeng Dong; Yifei Liu; Suixin Tang; Xiang Zhou; Yuanyuan Gao; Wei Wang; Yue Zhou; Xue Yang; Yanfeng Wang; Xiao Sun; Zhihang Zhong

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong

TL;DR

CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios, is presented and it is demonstrated that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

Abstract

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

TL;DR

Abstract

Paper Structure (51 sections, 5 equations, 11 figures, 7 tables)

This paper contains 51 sections, 5 equations, 11 figures, 7 tables.

Introduction
Related Work
Spatial Intelligence of VLMs
Sport Understanding
CourtSI Dataset
Data Engine
Court Annotation.
Ball Annotation.
Player Mesh Recovery.
Dataset Curation
Data Preparation.
Question-Answer Generation.
Quality Control.
Experiment
Evaluation Setup
...and 36 more sections

Figures (11)

Figure 1: Overview. We introduce a semi-automatic data engine that reconstructs sports scenes in 3D with court, player, and ball locations. Built upon this pipeline, we present CourtSI and CourtSI-Bench, the first large-scale spatial intelligence dataset and benchmark for sports scenarios. In addition, we provide extra evaluation protocols to validate applicability on an unseen sport and spatial-aware commentary.
Figure 2: Overview of the data engine. It consists of court annotation for metric-aware camera parameter estimation, ball annotation, and player mesh recovery. By leveraging court geometry and incorporating human-in-the-loop supervision, the system enables accurate and world-grounded reconstruction in sports scenarios.
Figure 3: Taxonomy and examples of CourtSI. The questions are categorized into: spatial counting, distance measurement, localization, and relational reasoning. Cnt. denotes counting. Obj. refers to object, including the ball and players. Cam. denotes camera. Ego. and Allo. denote to ego-centric and allo-centric views.
Figure 4: Distribution of CourtSI and CourtSI-Bench. Obj. refers to object, including the ball and players. Cam. denotes camera.
Figure 5: Error Analysis. The VLMs are prompted to provide detailed step-by-step reasoning. Correct and incorrect reasoning steps are highlighted in green and red, respectively. Questions and VLM's explanations are simplified for demonstration.
...and 6 more figures

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

TL;DR

Abstract

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Authors

TL;DR

Abstract

Table of Contents

Figures (11)