Table of Contents
Fetching ...

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong

TL;DR

This work defines fast, frequent, and fine-grained (F3) events in video and introduces F3Set, a large-scale benchmark with over 1,000 precisely timestamped event types and multi-level granularity using tennis as a case study. It presents an end-to-end baseline, F3ED, comprising a Video Encoder, an Event Localizer, a Multi-label Classifier, and a Contextual module to refine event sequences, along with an expansive annotation pipeline and toolchain. Through extensive experiments and ablations, the authors show that frame-wise dense features and long-term temporal reasoning are crucial, with multi-label targets and contextual refinement substantially improving performance over baselines. The results indicate clear challenges for existing methods at finer granularity and demonstrate the potential for generalization to semi-F3 domains, highlighting practical implications for sports analytics and broader real-world video understanding tasks.

Abstract

Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

TL;DR

This work defines fast, frequent, and fine-grained (F3) events in video and introduces F3Set, a large-scale benchmark with over 1,000 precisely timestamped event types and multi-level granularity using tennis as a case study. It presents an end-to-end baseline, F3ED, comprising a Video Encoder, an Event Localizer, a Multi-label Classifier, and a Contextual module to refine event sequences, along with an expansive annotation pipeline and toolchain. Through extensive experiments and ablations, the authors show that frame-wise dense features and long-term temporal reasoning are crucial, with multi-label targets and contextual refinement substantially improving performance over baselines. The results indicate clear challenges for existing methods at finer granularity and demonstrate the potential for generalization to semi-F3 domains, highlighting practical implications for sports analytics and broader real-world video understanding tasks.

Abstract

Analyzing Fast, Frequent, and Fine-grained (F) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce FSet, a benchmark that consists of video datasets for precise F event detection. Datasets in FSet are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, FSet contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on FSet, revealing substantial challenges for existing techniques. Additionally, we propose a new method, FED, for F event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.

Paper Structure

This paper contains 57 sections, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Example of detecting fast, frequent, and fine-grained events with precise moments.
  • Figure 2: Breakdown of F$^3$Set event class annotation.
  • Figure 3: An interface of the labeling tool. The panel on the right is application-customizable.
  • Figure 4: Overview of $\text{F$^3$ED}{}$. RGB images are processed by VE to capture frame-wise spatial-temporal features, which are passed to LCL to identify event timestamps and MLC to predict labels. Outputs from LCL and MLC are combined ('plus' symbol) to form an event representation sequence and refined by CTX module. 'Red squares' represent errors from purely visual predictions.
  • Figure 5: Video frames from a tennis rally.
  • ...and 1 more figures