YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

Yuan Lai; Zhiwei Shi; Chengxi Zhu

YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

Yuan Lai, Zhiwei Shi, Chengxi Zhu

TL;DR

This work tackles the problem of real-time 3D shuttlecock trajectory tracking for badminton robotics. It introduces YO-CSA-T, a YOLOv8s-based detector enhanced with a Contextual Transformer Block (CoT2f) and a Spatial Attention-Integrated Neck (SANeck), plus a decoupled head, to robustly detect the small, fast shuttlecock. The system maps 2D detections to 3D coordinates $(x,y,z)$ via stereo vision, predicts future positions, and uses a compensation module to interpolate missing frames, achieving 90.43% mAP@0.75 and real-time performance (>130 fps) on a dataset of 32,539 images. The approach enables accurate, real-time 3D trajectory extraction, with implications for robotic control, match analysis, and automated coaching in badminton.

Abstract

The 3D trajectory of a shuttlecock required for a badminton rally robot for human-robot competition demands real-time performance with high accuracy. However, the fast flight speed of the shuttlecock, along with various visual effects, and its tendency to blend with environmental elements, such as court lines and lighting, present challenges for rapid and accurate 2D detection. In this paper, we first propose the YO-CSA detection network, which optimizes and reconfigures the YOLOv8s model's backbone, neck, and head by incorporating contextual and spatial attention mechanisms to enhance model's ability in extracting and integrating both global and local features. Next, we integrate three major subtasks, detection, prediction, and compensation, into a real-time 3D shuttlecock trajectory detection system. Specifically, our system maps the 2D coordinate sequence extracted by YO-CSA into 3D space using stereo vision, then predicts the future 3D coordinates based on historical information, and re-projects them onto the left and right views to update the position constraints for 2D detection. Additionally, our system includes a compensation module to fill in missing intermediate frames, ensuring a more complete trajectory. We conduct extensive experiments on our own dataset to evaluate both YO-CSA's performance and system effectiveness. Experimental results show that YO-CSA achieves a high accuracy of 90.43% mAP@0.75, surpassing both YOLOv8s and YOLO11s. Our system performs excellently, maintaining a speed of over 130 fps across 12 test sequences.

YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

TL;DR

via stereo vision, predicts future positions, and uses a compensation module to interpolate missing frames, achieving 90.43% mAP@0.75 and real-time performance (>130 fps) on a dataset of 32,539 images. The approach enables accurate, real-time 3D trajectory extraction, with implications for robotic control, match analysis, and automated coaching in badminton.

Abstract

Paper Structure (21 sections, 6 equations, 10 figures, 4 tables)

This paper contains 21 sections, 6 equations, 10 figures, 4 tables.

INTRODUCTION
Related Work
Object Detection
Self-Attention Mechanism
Tracking in Small-Size Ball Sports
Real-time Detection Module
Brief Review of YOLO
Overview of Detection Network
Contextual Transformer Block with 2 Convolutions
Spatial Attention-Integrated Neck
Decouple head with SGE
YO-CSA-T System Design
Hardware Infrastructure of Stereo Vision
Detection Module
Comprehensive Tracking Workflow
...and 6 more sections

Figures (10)

Figure 1: Structure of YO-CSA
Figure 2: Structure of CoT2f
Figure 3: Spatial Attention-Integrated Neck
Figure 4: SGE with 2 Convolution
Figure 5: Decouple head with SGE
...and 5 more figures

YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

TL;DR

Abstract

YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (10)