Table of Contents
Fetching ...

A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage

Levi Harris

TL;DR

This work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain by aligning a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments.

Abstract

We present a reliable temporal grounding pipeline for video-to-analytic alignment of basketball broadcast footage. Given a series of frames as input, our method quickly and accurately extracts time-remaining and quarter values from basketball broadcast scenes. Our work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain. Our method aligns a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments. Unlike previous methods, we forgo the need to localize game clocks by fine-tuning an out-of-the-box object detector to find semantic text regions directly. Our end-to-end approach improves the generality of our work. Additionally, interpolation and parallelization techniques prepare our pipeline for deployment in a large computing cluster. All code is made publicly available.

A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage

TL;DR

This work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain by aligning a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments.

Abstract

We present a reliable temporal grounding pipeline for video-to-analytic alignment of basketball broadcast footage. Given a series of frames as input, our method quickly and accurately extracts time-remaining and quarter values from basketball broadcast scenes. Our work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain. Our method aligns a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments. Unlike previous methods, we forgo the need to localize game clocks by fine-tuning an out-of-the-box object detector to find semantic text regions directly. Our end-to-end approach improves the generality of our work. Additionally, interpolation and parallelization techniques prepare our pipeline for deployment in a large computing cluster. All code is made publicly available.

Paper Structure

This paper contains 10 sections, 4 equations, 2 figures, 1 algorithm.

Figures (2)

  • Figure 1: We motivate our pipeline with the figure above. Given pre-annotated game logs and unlabeled video footage, we devise a method to align both modalities. The final result of our temporal grounding pipeline is a video-to-text aligned corpus intended to enable rapid dataset development.
  • Figure 2: Extract Timestamps from Broadcast