Table of Contents
Fetching ...

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

Haopeng Li, Andong Deng, Jun Liu, Hossein Rahmani, Yulan Guo, Bernt Schiele, Mohammed Bennamoun, Qiuhong Ke

TL;DR

This paper introduces Sports-QA, the first large-scale sports VideoQA benchmark with descriptive, temporal, causal, and counterfactual questions across multiple sports, generated from professional action annotations. It also proposes the Auto-Focus Transformer (AFT) that adaptively focuses on temporal scales via Auto-Focus Attention to handle multi-scale dependencies. Comprehensive experiments show AFT achieves state-of-the-art results on Sports-QA and analyzes cross-sport generalization, focus-length effects, and qualitative predictions. The work advances sports analytics by enabling fine-grained reasoning over professional athletic actions and sets a resource for future research in video question answering within sports contexts.

Abstract

Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

TL;DR

This paper introduces Sports-QA, the first large-scale sports VideoQA benchmark with descriptive, temporal, causal, and counterfactual questions across multiple sports, generated from professional action annotations. It also proposes the Auto-Focus Transformer (AFT) that adaptively focuses on temporal scales via Auto-Focus Attention to handle multi-scale dependencies. Comprehensive experiments show AFT achieves state-of-the-art results on Sports-QA and analyzes cross-sport generalization, focus-length effects, and qualitative predictions. The work advances sports analytics by enabling fine-grained reasoning over professional athletic actions and sets a resource for future research in video question answering within sports contexts.

Abstract

Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.
Paper Structure (19 sections, 1 equation, 7 figures, 9 tables)

This paper contains 19 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Illustrations of general VideoQA, which focus on common basic understanding, and sports VideoQA, which requires professional action understanding and action relation reasoning.
  • Figure 2: The action hierarchy of the MultiSports dataset is depicted at the top, while the FineGym dataset's hierarchy is shown at the bottom. It's important to note that the figure includes only four example actions for each sport.
  • Figure 3: Example of Sports-QA: The actions in the green boxes (such as "2-point shot") are the query actions, while the actions in the yellow boxes (such as "block") represent the effects. For ball games, annotators provide attribute labels, and we generate QA pairs based on these attributes. In gymnastics, we generate QA pairs using annotations from MultiSports/FineGym.
  • Figure 4: The distributions of answer classes broken down by question types.
  • Figure 5: The structure of the sports VideoQA model based on the proposed Auto-Focus Transformer.
  • ...and 2 more figures