Table of Contents
Fetching ...

Fine-grained Dynamic Network for Generic Event Boundary Detection

Ziwei Zheng, Lijun He, Le Yang, Fan Li

TL;DR

This work tackles generic event boundary detection (GEBD) by introducing DyBDet, a dynamic network that allocates subnet processing to video snippets based on boundary characteristics. It combines a multi-exit backbone with a multi-order difference detector (MDE) and a pairwise contrast module (PCM) to capture both simple, low-level changes and complex, high-level dynamics, using local windows and soft-label training for robustness. Empirical results on Kinetics-GEBD and TAPOS show state-of-the-art performance with substantial efficiency gains due to adaptive inference and partial exits, outperforming prior methods across Rel.Dis thresholds and reducing computational cost. The approach demonstrates strong generalization, interpretability via pairwise similarity maps, and potential applicability to broader temporal localization tasks. Overall, DyBDet advances GEBD by enabling fine-grained, efficient boundary detection that adapts to the inherent diversity of event boundaries in long-form video.

Abstract

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.

Fine-grained Dynamic Network for Generic Event Boundary Detection

TL;DR

This work tackles generic event boundary detection (GEBD) by introducing DyBDet, a dynamic network that allocates subnet processing to video snippets based on boundary characteristics. It combines a multi-exit backbone with a multi-order difference detector (MDE) and a pairwise contrast module (PCM) to capture both simple, low-level changes and complex, high-level dynamics, using local windows and soft-label training for robustness. Empirical results on Kinetics-GEBD and TAPOS show state-of-the-art performance with substantial efficiency gains due to adaptive inference and partial exits, outperforming prior methods across Rel.Dis thresholds and reducing computational cost. The approach demonstrates strong generalization, interpretability via pairwise similarity maps, and potential applicability to broader temporal localization tasks. Overall, DyBDet advances GEBD by enabling fine-grained, efficient boundary detection that adapts to the inherent diversity of event boundaries in long-form video.

Abstract

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
Paper Structure (38 sections, 8 equations, 7 figures, 8 tables)

This paper contains 38 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Adaptive inference for video snippets within the dynamic architecture. We use temporal differences with different orders to capture the distinctive boundary features and plot the normalized activations for better visualization. The ground-truth boundaries are highlighted with red lines. (a): The boundary of the shot change can be obviously identified with low-level appearance features w/o temporal difference and can exit early to save computations. (b): The action change relies on features of high-level semantics and high-order temporal differences to reveal the boundary.
  • Figure 2: Overview of the proposed DyBDet. Boundaries are highlighted with red lines. (a): the multi-exit network to enable frame-level adaptive inference, (b): the multi-order difference detector to distinguish boundaries with various characteristics.
  • Figure 3: Comparisons of F1@0.05 v.s. FLOPs on Kinetics-GEBD with others. We report the average FLOPs per frame through the whole inference pipeline.
  • Figure 4: Dynamic networks with partial exit composed by different detectors. MDE+PCM indicates the detector in DyBDet.
  • Figure 5: The predictions of different detectors w/o partial exit. The red line represents the threshold $\epsilon$, and the gray area indicates the ground-truth labels w.r.t F1@0.05. Stars are detected boundaries.
  • ...and 2 more figures