Table of Contents
Fetching ...

Online Generic Event Boundary Detection

Hyungrok Jung, Daneul Kim, Seunggyun Lim, Jeany Son, Jonghyun Choi

TL;DR

This work introduces Online Generic Event Boundary Detection (On-GEBD), a streaming-boundary task that requires predicting event boundaries in real time using only past and present frames. It proposes ESTimator, a two-component framework consisting of the Consistent Event Anticipator (CEA) and the Online Boundary Discriminator (OBD), inspired by Event Segmentation Theory, to anticipate ongoing event dynamics and detect boundaries via adaptive error thresholds. CEA uses a transformer decoder to predict future frame features $\hat{\mathbf{f}}_t$ from past frames, trained with EST loss and REST loss to reinforce consistent predictions, while OBD maintains a memory of past errors and applies Gaussian-based statistical testing with threshold $\tau$ to determine boundaries, achieving dynamic sensitivity to diverse transitions. On two GEBD benchmarks (Kinetics-GEBD and TAPOS), ESTimator outperforms online baselines and is competitive with offline GEBD methods, with ablations confirming complementary benefits of EST, REST, and OBD and demonstrating practical real-time performance. This approach advances online video understanding by aligning model decisions with human perceptual processes and offering robust, taxonomy-free boundary detection in streaming content, with potential applications in long-form video analysis and real-time monitoring.

Abstract

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

Online Generic Event Boundary Detection

TL;DR

This work introduces Online Generic Event Boundary Detection (On-GEBD), a streaming-boundary task that requires predicting event boundaries in real time using only past and present frames. It proposes ESTimator, a two-component framework consisting of the Consistent Event Anticipator (CEA) and the Online Boundary Discriminator (OBD), inspired by Event Segmentation Theory, to anticipate ongoing event dynamics and detect boundaries via adaptive error thresholds. CEA uses a transformer decoder to predict future frame features from past frames, trained with EST loss and REST loss to reinforce consistent predictions, while OBD maintains a memory of past errors and applies Gaussian-based statistical testing with threshold to determine boundaries, achieving dynamic sensitivity to diverse transitions. On two GEBD benchmarks (Kinetics-GEBD and TAPOS), ESTimator outperforms online baselines and is competitive with offline GEBD methods, with ablations confirming complementary benefits of EST, REST, and OBD and demonstrating practical real-time performance. This approach advances online video understanding by aligning model decisions with human perceptual processes and offering robust, taxonomy-free boundary detection in streaming content, with potential applications in long-form video analysis and real-time monitoring.

Abstract

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

Paper Structure

This paper contains 37 sections, 7 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Comparison of offline-GEBD and human perception with illustration of Event Segmentation Theory (EST). (a) In a conventional GEBD task, all the event boundaries are determined by utilizing all past and future frames. However, human segments event sequentially relying only on visuals available at the current moment. (b) The illustration of EST shows how humans perceive events. When we perceive visuals, we naturally expect continuous visuals to be recognized. When a significant difference from the given visual input occurs, we perceive it as an event boundary.
  • Figure 2: Overview of our ESTimator framework. Our framework consists of three major components: Consistent Event Anticipator (CEA) which generates a consistent future frame feature using a learnable token (Left). EST-inspired training objective that accumulates frame-level (EST loss) and region-level (REST loss) prediction errors derived from the discrepancy between the generated future frame from CEA with the actual input frame (Upper right). Online Boundary Discriminator (OBD) with a queue that stores past error prediction values to conduct statistical testing on the error derived from the current input frame for inference (Lower right).
  • Figure 3: An illustration of how Online Boundary Detector (OBD) applies a dynamic threshold to capture diverse event transitions.
  • Figure 4: Qualitative result. Comparison between our proposed framework and baseline from other online video understanding task. Note that baseline here refers to TeSTra testra with binary classifier head attached for event boundary detection.
  • Figure 5: Additional qualitative result on Kinetics-GEBD dataset. Comparison between our proposed framework and the baseline (TeSTra-BC testra).
  • ...and 1 more figures