Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Kim Hoang Tran; Phuc Vuong Do; Ngoc Quoc Ly; Ngan Le

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Kim Hoang Tran, Phuc Vuong Do, Ngoc Quoc Ly, Ngan Le

TL;DR

The paper tackles action spotting in sports videos, where small, scene-specific elements are often missed by global-frame backbones. It introduces the Unifying Global and Local (UGL) module to disentangle global environmental context from local relevant scene entities, using a RegNet-Y 2D backbone with a time-shift module and the Vision-Language model GLIP with Adaptive Attention Mechanism for local entities. A Long-Term Temporal Reasoning module based on a bidirectional GRU aggregates context and yields per-frame action scores, trained with Focal Loss to address class imbalance. The approach achieves state-of-the-art results on SoccerNet-v2, FineDiving, and FineGym, while offering interpretable insights into which scene elements drive actions, and is released with open-source code.

Abstract

Sports videos pose complex challenges, including cluttered backgrounds, camera angle changes, small action-representing objects, and imbalanced action class distribution. Existing methods for detecting actions in sports videos heavily rely on global features, utilizing a backbone network as a black box that encompasses the entire spatial frame. However, these approaches tend to overlook the nuances of the scene and struggle with detecting actions that occupy a small portion of the frame. In particular, they face difficulties when dealing with action classes involving small objects, such as balls or yellow/red cards in soccer, which only occupy a fraction of the screen space. To address these challenges, we introduce a novel approach that analyzes and models scene entities using an adaptive attention mechanism. Particularly, our model disentangles the scene content into the global environment feature and local relevant scene entities feature. To efficiently extract environmental features while considering temporal information with less computational cost, we propose the use of a 2D backbone network with a time-shift mechanism. To accurately capture relevant scene entities, we employ a Vision-Language model in conjunction with the adaptive attention mechanism. Our model has demonstrated outstanding performance, securing the 1st place in the SoccerNet-v2 Action Spotting, FineDiving, and FineGym challenge with a substantial performance improvement of 1.6, 2.0, and 1.3 points in avg-mAP compared to the runner-up methods. Furthermore, our approach offers interpretability capabilities in contrast to other deep learning models, which are often designed as black boxes. Our code and models are released at: https://github.com/Fsoft-AIC/unifying-global-local-feature.

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 5 figures, 4 tables)

This paper contains 17 sections, 7 equations, 5 figures, 4 tables.

Introduction
Related works
Our proposed method
Unifying Global and Local (UGL) Module
Global Environment Feature
Local Relevant Entities Feature
Fusion Component
Long-term Temporal Reasoning (LTR) Module
Training Methodology
Experiments
Sports Video Datasets
Evaluation metrics
Implementation Details
Comparison with the SOTA Methods
Ablation Studies
...and 2 more sections

Figures (5)

Figure 1: Comparison between our proposed framework with existing SOTA methods including the Yahoo methodyahoo-soares-temporal (a) and the Stanford methode2e_spotting (b). The existing methods (a, b) extract features from the entire video frames, which may mistakenly overlook smaller key scene entities (e.g., red/yellow cards) that play a crucial role in actions. In contrast to these existing methods, our approach (c) disentangles the scene into global environmental features and local scene entities utilizing the Vision-Language (VL) model and an adaptive attention mechanism (AAM) to better focus on the pertinent entities that are actively involved in the action.
Figure 2: The architecture of our proposed UGL module. Given a frame $v_t$, our UGL module concurrently extracts both the global environment feature $f^{Env}$ and the local relevant entities feature $f^{Ent}$. Subsequently, it combines these features to produce a unified global-local entities-environment feature $f^{Ent-Env}$, acting as the unifying link between the local and global representations.
Figure 3: Pipeline to obtain the local scene entities features $f^{Inter-Ent}$ from GLIP, which correspond to the vocabulary of sports scene entities.
Figure 4: The overall architecture of our proposed network, consisting of Unifying Global-Local (UGL) module and Long-term Temporal Reasoning (LTR) module.
Figure 5: Visualization of four challenges in SoccerNet-v2 data

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

TL;DR

Abstract

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Authors

TL;DR

Abstract

Table of Contents

Figures (5)