Table of Contents
Fetching ...

Automated Detection of Sport Highlights from Audio and Video Sources

Francesco Della Santa, Morgana Lalli

TL;DR

The paper addresses automated highlighting of sports content by proposing a lightweight dual-stream DL framework that combines audio analysis of Mel-spectrograms with grayscale video frame analysis. Audio is modeled with a 2D CNN on Mel-spectrograms, while video uses transfer learning from image classifiers adapted to multi-frame grayscale inputs, with their outputs fused via an ensemble score He(τ) = 0.5(Ha(τ) + Hv(τ)). On small, balanced datasets, the audio model achieves about 0.89 accuracy and the video model about 0.83, with the ensemble reducing false positives and negatives for robust highlight detection across sports. The approach offers scalable, fast deployment for automated highlights generation and can extend to broader scene-detection tasks with further architectural enhancements and more data. It demonstrates practical potential for improving content summaries, recommendations, and viewer engagement in sports media.

Abstract

This study presents a novel Deep Learning-based and lightweight approach for the automated detection of sports highlights (HLs) from audio and video sources. HL detection is a key task in sports video analysis, traditionally requiring significant human effort. Our solution leverages Deep Learning (DL) models trained on relatively small datasets of audio Mel-spectrograms and grayscale video frames, achieving promising accuracy rates of 89% and 83% for audio and video detection, respectively. The use of small datasets, combined with simple architectures, demonstrates the practicality of our method for fast and cost-effective deployment. Furthermore, an ensemble model combining both modalities shows improved robustness against false positives and false negatives. The proposed methodology offers a scalable solution for automated HL detection across various types of sports video content, reducing the need for manual intervention. Future work will focus on enhancing model architectures and extending this approach to broader scene-detection tasks in media analysis.

Automated Detection of Sport Highlights from Audio and Video Sources

TL;DR

The paper addresses automated highlighting of sports content by proposing a lightweight dual-stream DL framework that combines audio analysis of Mel-spectrograms with grayscale video frame analysis. Audio is modeled with a 2D CNN on Mel-spectrograms, while video uses transfer learning from image classifiers adapted to multi-frame grayscale inputs, with their outputs fused via an ensemble score He(τ) = 0.5(Ha(τ) + Hv(τ)). On small, balanced datasets, the audio model achieves about 0.89 accuracy and the video model about 0.83, with the ensemble reducing false positives and negatives for robust highlight detection across sports. The approach offers scalable, fast deployment for automated highlights generation and can extend to broader scene-detection tasks with further architectural enhancements and more data. It demonstrates practical potential for improving content summaries, recommendations, and viewer engagement in sports media.

Abstract

This study presents a novel Deep Learning-based and lightweight approach for the automated detection of sports highlights (HLs) from audio and video sources. HL detection is a key task in sports video analysis, traditionally requiring significant human effort. Our solution leverages Deep Learning (DL) models trained on relatively small datasets of audio Mel-spectrograms and grayscale video frames, achieving promising accuracy rates of 89% and 83% for audio and video detection, respectively. The use of small datasets, combined with simple architectures, demonstrates the practicality of our method for fast and cost-effective deployment. Furthermore, an ensemble model combining both modalities shows improved robustness against false positives and false negatives. The proposed methodology offers a scalable solution for automated HL detection across various types of sports video content, reducing the need for manual intervention. Future work will focus on enhancing model architectures and extending this approach to broader scene-detection tasks in media analysis.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Example of Mel-spectrogram of an audio chunk
  • Figure 2: Example of black-and-white frame of a video frame
  • Figure 3: First clip. Prediction scores of the models; blue $\mathcal{M}_a$ (audio), orange $\mathcal{M}_v$ (video), and green $\mathcal{M}_e$ (ensemble). The red line represents the detection threshold $\epsilon=0.5$ The pictures illustrate the action in the video clip at the seconds pointed out by the arrows.
  • Figure 4: Second clip. Prediction scores of the models; blue $\mathcal{M}_a$ (audio), orange $\mathcal{M}_v$ (video), and green $\mathcal{M}_e$ (ensemble). The red line represents the detection threshold $\epsilon=0.5$. The pictures illustrate the action in the video clip at the seconds pointed out by the arrows.

Theorems & Definitions (3)

  • Remark 2.1: On the choice of $k$ for building the datasets
  • Remark 3.1: NDA and limited description of NN architectures
  • Remark 4.1: NDA and limited description of NN training options