Table of Contents
Fetching ...

Rethinking the Architecture Design for Efficient Generic Event Boundary Detection

Ziwei Zheng, Zechuan Zhang, Yulin Wang, Shiji Song, Gao Huang, Le Yang

TL;DR

The paper tackles the efficiency gap in Generic Event Boundary Detection (GEBD) by rethinking architectural designs from a minimal baseline to a highly efficient family, EfficientGEBD. It reveals that a concise baseline (BasicGEBD) can achieve strong results and that image-domain backbones introduce a distraction that harms boundary detection, especially for event-level boundaries. By progressively modernizing the backbone, encoder, fusion, and decoder, and by adopting video-domain spatiotemporal modeling via Diff Mixer and Cross Attention, the authors achieve state-of-the-art performance with substantial speedups (up to 2.2x) on Kinetics-GEBD, and strong results on TAPOS and SoccernetV2. The work argues for prioritizing efficiency in GEBD design and highlights the practical benefits for long-form video processing, offering guidance toward future video-domain GEBD architectures.

Abstract

Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{https://github.com/Ziwei-Zheng/EfficientGEBD}.

Rethinking the Architecture Design for Efficient Generic Event Boundary Detection

TL;DR

The paper tackles the efficiency gap in Generic Event Boundary Detection (GEBD) by rethinking architectural designs from a minimal baseline to a highly efficient family, EfficientGEBD. It reveals that a concise baseline (BasicGEBD) can achieve strong results and that image-domain backbones introduce a distraction that harms boundary detection, especially for event-level boundaries. By progressively modernizing the backbone, encoder, fusion, and decoder, and by adopting video-domain spatiotemporal modeling via Diff Mixer and Cross Attention, the authors achieve state-of-the-art performance with substantial speedups (up to 2.2x) on Kinetics-GEBD, and strong results on TAPOS and SoccernetV2. The work argues for prioritizing efficiency in GEBD design and highlights the practical benefits for long-form video processing, offering guidance toward future video-domain GEBD architectures.

Abstract

Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{https://github.com/Ziwei-Zheng/EfficientGEBD}.
Paper Structure (21 sections, 8 figures, 4 tables)

This paper contains 21 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The throughput vs. F1 score of GEBD methods on Kinetics-GEBD shou2021generic.
  • Figure 2: We modernize the proposed BasicGEBD towards the design of an efficient GEBD model. The colored bars are the F1@0.05 scores of models and the gray bars depict the GFLOPs. A hatched bar means the modification is not adopted. The orange stars mean FPS. In the end, our EfficientGEBD with ResNet50-L2* can outperform the previous SOTA method (SC-Transformer li2022structured), and can be further obviously improved by using CSN backbone tran2019video.
  • Figure 3: The architecture of BasicGEBD.
  • Figure 4: The GFLOPs v.s F1 score of BasicGEBD with different sizes of ResNets as the backbone.
  • Figure 5: The illustrations of the encoder (a) and the fusion module (d,e). In (b,c), we calculate the L2-norm and the cosine similarity map of the features at different timestamps to see whether the discriminative boundary features can be captured.
  • ...and 3 more figures