F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu; Kan Jiang; Murong Ma; Zhe Hou; Yun Lin; Jin Song Dong

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Zhaoyu Liu, Kan Jiang, Murong Ma, Zhe Hou, Yun Lin, Jin Song Dong

TL;DR

This work defines fast, frequent, and fine-grained (F3) events in video and introduces F3Set, a large-scale benchmark with over 1,000 precisely timestamped event types and multi-level granularity using tennis as a case study. It presents an end-to-end baseline, F3ED, comprising a Video Encoder, an Event Localizer, a Multi-label Classifier, and a Contextual module to refine event sequences, along with an expansive annotation pipeline and toolchain. Through extensive experiments and ablations, the authors show that frame-wise dense features and long-term temporal reasoning are crucial, with multi-label targets and contextual refinement substantially improving performance over baselines. The results indicate clear challenges for existing methods at finer granularity and demonstrate the potential for generalization to semi-F3 domains, highlighting practical implications for sports analytics and broader real-world video understanding tasks.

Abstract

Analyzing Fast, Frequent, and Fine-grained (F$^3$) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F$^3$ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F$^3$Set, a benchmark that consists of video datasets for precise F$^3$ event detection. Datasets in F$^3$Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F$^3$Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F$^3$Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F$^3$ED, for F$^3$ event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

TL;DR

Abstract

Analyzing Fast, Frequent, and Fine-grained (F

) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F

criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F

Set, a benchmark that consists of video datasets for precise F

event detection. Datasets in F

Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently, F

Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F

Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F

ED, for F

event detections, achieving superior performance. The dataset, model, and benchmark code are available at https://github.com/F3Set/F3Set.

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

TL;DR

Abstract

F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)