Online Generic Event Boundary Detection
Hyungrok Jung, Daneul Kim, Seunggyun Lim, Jeany Son, Jonghyun Choi
TL;DR
This work introduces Online Generic Event Boundary Detection (On-GEBD), a streaming-boundary task that requires predicting event boundaries in real time using only past and present frames. It proposes ESTimator, a two-component framework consisting of the Consistent Event Anticipator (CEA) and the Online Boundary Discriminator (OBD), inspired by Event Segmentation Theory, to anticipate ongoing event dynamics and detect boundaries via adaptive error thresholds. CEA uses a transformer decoder to predict future frame features $\hat{\mathbf{f}}_t$ from past frames, trained with EST loss and REST loss to reinforce consistent predictions, while OBD maintains a memory of past errors and applies Gaussian-based statistical testing with threshold $\tau$ to determine boundaries, achieving dynamic sensitivity to diverse transitions. On two GEBD benchmarks (Kinetics-GEBD and TAPOS), ESTimator outperforms online baselines and is competitive with offline GEBD methods, with ablations confirming complementary benefits of EST, REST, and OBD and demonstrating practical real-time performance. This approach advances online video understanding by aligning model decisions with human perceptual processes and offering robust, taxonomy-free boundary detection in streaming content, with potential applications in long-form video analysis and real-time monitoring.
Abstract
Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.
