Table of Contents
Fetching ...

Structured Context Learning for Generic Event Boundary Detection

Xin Gu, Congcong Li, Xinyao Wang, Dexiang Hong, Libo Zhang, Tiejian Luo, Longyin Wen, Heng Fan

TL;DR

The paper tackles Generic Event Boundary Detection (GEBD) by introducing Structured Context Learning (SCL) with Structured Partition of Sequence (SPoS) to provide localized, shared contextual information with linear-time complexity. SPoS partitions the video into K slices to generate structured context around each candidate frame, enabling flexible temporal models and reducing redundant computations, while group similarity maps are used with a lightweight FCN to predict boundaries. Gaussian smoothing of ground-truth boundaries addresses annotator disagreement, enhancing training stability. Experiments on Kinetics-GEBD and TAPOS show state-of-the-art accuracy and speed, with additional gains on shot-transition datasets highlighting strong generalization and practical impact.

Abstract

Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

Structured Context Learning for Generic Event Boundary Detection

TL;DR

The paper tackles Generic Event Boundary Detection (GEBD) by introducing Structured Context Learning (SCL) with Structured Partition of Sequence (SPoS) to provide localized, shared contextual information with linear-time complexity. SPoS partitions the video into K slices to generate structured context around each candidate frame, enabling flexible temporal models and reducing redundant computations, while group similarity maps are used with a lightweight FCN to predict boundaries. Gaussian smoothing of ground-truth boundaries addresses annotator disagreement, enhancing training stability. Experiments on Kinetics-GEBD and TAPOS show state-of-the-art accuracy and speed, with additional gains on shot-transition datasets highlighting strong generalization and practical impact.

Abstract

Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Proposed Structured Context Learning method. A CNN backbone extracts 2D features that are pooled and converted to a sequence. The SPoS mechanism partitions the sequence, providing structured context $\mathbf{I}_t$. Temporal model (Transformer LSTM, or GRU) learns high-level representations and enable feature sharing. Group similarities encode frame differences, and a lightweight FCN predicts event boundaries based on 2D grouped similarity maps.
  • Figure 2: Illustration of proposed SPoS. The dark orange square denotes the candidate frame $I_t$, while the light orange squares denote its structured context. To obtain adjacent $K$ frames $I_{\leftarrow t}$ before candidate frame $I_t$ and $K$ frames $I_{t \rightarrow}$ after $I_t$ , we split the input video sequence into $K$ slices. Each slice $S_k$ is responsible to produce adjacent frames $I_{\leftarrow t}$ and $I_{t \rightarrow}$ for the frames of specific indices. All video frames can be covered within all $K$ slices and can be efficiently processed in parallel.
  • Figure 3: Visualization of grouped similarity maps $\mathbf{S}_t$, $G=4$ in this example. First row indicates that there is a potential boundary in this local sequence while the second row shows no boundary in this sequence. We can also observe slightly different patterns between the same group, which may imply that each group is learning in a different aspect.
  • Figure 4: Example qualitative results on Kinetics-GEBD validation split. Best view in color.