Automated Detection of Sport Highlights from Audio and Video Sources
Francesco Della Santa, Morgana Lalli
TL;DR
The paper addresses automated highlighting of sports content by proposing a lightweight dual-stream DL framework that combines audio analysis of Mel-spectrograms with grayscale video frame analysis. Audio is modeled with a 2D CNN on Mel-spectrograms, while video uses transfer learning from image classifiers adapted to multi-frame grayscale inputs, with their outputs fused via an ensemble score He(τ) = 0.5(Ha(τ) + Hv(τ)). On small, balanced datasets, the audio model achieves about 0.89 accuracy and the video model about 0.83, with the ensemble reducing false positives and negatives for robust highlight detection across sports. The approach offers scalable, fast deployment for automated highlights generation and can extend to broader scene-detection tasks with further architectural enhancements and more data. It demonstrates practical potential for improving content summaries, recommendations, and viewer engagement in sports media.
Abstract
This study presents a novel Deep Learning-based and lightweight approach for the automated detection of sports highlights (HLs) from audio and video sources. HL detection is a key task in sports video analysis, traditionally requiring significant human effort. Our solution leverages Deep Learning (DL) models trained on relatively small datasets of audio Mel-spectrograms and grayscale video frames, achieving promising accuracy rates of 89% and 83% for audio and video detection, respectively. The use of small datasets, combined with simple architectures, demonstrates the practicality of our method for fast and cost-effective deployment. Furthermore, an ensemble model combining both modalities shows improved robustness against false positives and false negatives. The proposed methodology offers a scalable solution for automated HL detection across various types of sports video content, reducing the need for manual intervention. Future work will focus on enhancing model architectures and extending this approach to broader scene-detection tasks in media analysis.
