Table of Contents
Fetching ...

Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong

TL;DR

This work tackles few-shot precise event spotting in sports videos by introducing UMEG-Net, a Unified Multi-Entity Graph that integrates human poses, ball trajectories, and court context. The framework fuses a graph-based spatio-temporal encoder with a parameter-free temporal shift and leverages multimodal distillation to train an RGB student, enabling robust frame-accurate event localization with limited labels. Across five fine-grained sports datasets, UMEG-Net achieves state-of-the-art few-shot performance, and the distillation variant further enhances robustness, while ablations show the value of including objects and environment, temporal scales, and distillation. The approach offers a scalable solution for PES with practical implications for sports analytics and video understanding, while maintaining competitive performance under full supervision.

Abstract

Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.

Few-Shot Precise Event Spotting via Unified Multi-Entity Graph and Distillation

TL;DR

This work tackles few-shot precise event spotting in sports videos by introducing UMEG-Net, a Unified Multi-Entity Graph that integrates human poses, ball trajectories, and court context. The framework fuses a graph-based spatio-temporal encoder with a parameter-free temporal shift and leverages multimodal distillation to train an RGB student, enabling robust frame-accurate event localization with limited labels. Across five fine-grained sports datasets, UMEG-Net achieves state-of-the-art few-shot performance, and the distillation variant further enhances robustness, while ablations show the value of including objects and environment, temporal scales, and distillation. The approach offers a scalable solution for PES with practical implications for sports analytics and video understanding, while maintaining competitive performance under full supervision.

Abstract

Precise event spotting (PES) aims to recognize fine-grained events at exact moments and has become a key component of sports analytics. This task is particularly challenging due to rapid succession, motion blur, and subtle visual differences. Consequently, most existing methods rely on domain-specific, end-to-end training with large labeled datasets and often struggle in few-shot conditions due to their dependence on pixel- or pose-based inputs alone. However, obtaining large labeled datasets is practically hard. We propose a Unified Multi-Entity Graph Network (UMEG-Net) for few-shot PES. UMEG-Net integrates human skeletons and sport-specific object keypoints into a unified graph and features an efficient spatio-temporal extraction module based on advanced GCN and multi-scale temporal shift. To further enhance performance, we employ multimodal distillation to transfer knowledge from keypoint-based graphs to visual representations. Our approach achieves robust performance with limited labeled data and significantly outperforms baseline models in few-shot settings, providing a scalable and effective solution for few-shot PES. Code is publicly available at https://github.com/LZYAndy/UMEG-Net.

Paper Structure

This paper contains 43 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Precise event spotting in sports videos, with event timestamps highlighted in red. Each scene can be represented by a unified graph including human poses and sport-related entity keypoints (e.g., ball, court corners).
  • Figure 2: The framework of our proposed method, including UMEG-Net and multimodal distillation. Each frame is converted to a unified multi-entity graph and processed by stacked UMEG Blocks to produce features for precise event spotting. A transformer-based RGB student is trained via knowledge distillation from the frozen graph-based teacher.
  • Figure 3: F1$_\text{evt}$ and Edit scores under few-shot ($k$-clip) training. Percentages indicate the fraction of the full training set.