Table of Contents
Fetching ...

BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization

Rahul Kumar, Vipul Baghel, Sudhanshu Singh, Bikash Kumar Badatya, Shivam Yadav, Babji Srinivasan, Ravi Hegde

TL;DR

BoxingVI addresses the scarcity of realistic, annotated boxing data for vision-based action understanding. It introduces 6,915 temporally segmented punch clips across six punch types, drawn from 20 unedited YouTube sessions, with 2D pose trajectories and per-clip labels to support temporal localization and pose-conditioned recognition in monocular RGB video. The dataset uses 15 training and 5 validation subjects across 18 athletes and provides frame-level punch boundaries, enabling robust evaluation under unconstrained conditions and enabling applications in automated coaching, performance assessment, and digital-twin development. By aligning temporal, spatial, and semantic annotations in real-world boxing footage, BoxingVI offers a foundation for future extensions to multi-person interactions and cross-discipline combat analytics.

Abstract

Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization

TL;DR

BoxingVI addresses the scarcity of realistic, annotated boxing data for vision-based action understanding. It introduces 6,915 temporally segmented punch clips across six punch types, drawn from 20 unedited YouTube sessions, with 2D pose trajectories and per-clip labels to support temporal localization and pose-conditioned recognition in monocular RGB video. The dataset uses 15 training and 5 validation subjects across 18 athletes and provides frame-level punch boundaries, enabling robust evaluation under unconstrained conditions and enabling applications in automated coaching, performance assessment, and digital-twin development. By aligning temporal, spatial, and semantic annotations in real-world boxing footage, BoxingVI offers a foundation for future extensions to multi-person interactions and cross-discipline combat analytics.

Abstract

Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

Paper Structure

This paper contains 6 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The dataset comprises $20$ different subjects from YouTube videos. A: Subjects S1 to S15 are used for training, while S16 to S20 are used for validation. Each column corresponds to a punch class (Cross, Jab, Lead Hook, Lead Uppercut, Rear Hook, Rear Uppercut) shown across subjects. B: A sequence from frame 1 to 6 depicts a cross punch from initiation to completion. Thumbnails are reproduced under the Fair Use Policy of YouTube.
  • Figure 2: Tracking the person of interest across the video using the least Euclidean distance method applied to the detected poses. By determining the center of mass from the keypoints of the shoulders and hips, the individual is identified and consistently followed throughout the videos. Thumbnails are reproduced under the Fair Use Policy of YouTube.
  • Figure 3: The least Euclidean distance method is used for tracking across $m$ number of frames. $P_1, P_2, \dots, P_n$ represent the total number of detected persons, where $P_i$ is the person of interest. The Euclidean distance $r_i$ is calculated between the position of the person of interest in the first frame and the $i$-th person in the second frame. The person with the least $r_i$ is assumed to be the same as the person of interest.