Table of Contents
Fetching ...

Temporal Action Detection Model Compression by Progressive Block Drop

Xiaoyong Chen, Yong Guo, Jiaming Liang, Sitong Zhuang, Runhao Zeng, Xiping Hu

TL;DR

This work tackles the high computational cost of temporal action detection by introducing a depth-focused model compression, Progressive Block Drop. By iteratively dropping blocks guided by a Block Selection Evaluator and recovering performance with Cross-Depth Alignment using LoRA-based fine-tuning, the approach reduces MACs by about $25\%$ on THUMOS14 and ActivityNet-1.3 while maintaining or even improving accuracy, and it remains complementary to channel pruning. The method emphasizes that deeper networks are not always more efficient on GPUs, since large matrix computations benefit from fewer, wider layers rather than many small ones, and it demonstrates robust improvements across multiple TAD architectures and datasets, including FineActions and AdaTAD, as well as a natural language localization task. The practical impact lies in enabling faster, more resource-efficient TAD deployment in real-world settings like robotics and autonomous systems, with broad compatibility with existing pruning techniques and potential for further acceleration with sparse activations. Overall, the paper provides a concrete, generalizable strategy for depth-wise compression that preserves performance and enhances inference speed in temporal action detection.

Abstract

Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.

Temporal Action Detection Model Compression by Progressive Block Drop

TL;DR

This work tackles the high computational cost of temporal action detection by introducing a depth-focused model compression, Progressive Block Drop. By iteratively dropping blocks guided by a Block Selection Evaluator and recovering performance with Cross-Depth Alignment using LoRA-based fine-tuning, the approach reduces MACs by about on THUMOS14 and ActivityNet-1.3 while maintaining or even improving accuracy, and it remains complementary to channel pruning. The method emphasizes that deeper networks are not always more efficient on GPUs, since large matrix computations benefit from fewer, wider layers rather than many small ones, and it demonstrates robust improvements across multiple TAD architectures and datasets, including FineActions and AdaTAD, as well as a natural language localization task. The practical impact lies in enabling faster, more resource-efficient TAD deployment in real-world settings like robotics and autonomous systems, with broad compatibility with existing pruning techniques and potential for further acceleration with sparse activations. Overall, the paper provides a concrete, generalizable strategy for depth-wise compression that preserves performance and enhances inference speed in temporal action detection.

Abstract

Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.

Paper Structure

This paper contains 29 sections, 7 equations, 6 figures, 14 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparisons between reducing depth and reducing width. For a 768-frame video sequence with a resolution of $160^2$, using VideoMAE-S as the feature extractor and Actionformer as the detection head, 95% of the computational overhead comes from the feature extractor. Most pruning methods reduce the size of layer weight matrices, leading to a slim-and-tall network structure. In contrast, our progressive block drop method reduces network depth, achieving 1.19× faster inference at the same computational cost. These findings, detailed in Section \ref{['sec:Inference Time']}, suggest that our approach results in a more hardware-efficient model.
  • Figure 2: Analysis of TAD models at block level on THUMOS14 datasets, using VideoMAE-S. Blue curve: impact of dropping a single block on the model's detection accuracy, where most cases result in minimal accuracy degradation. Green line: MSE of the input and output features for each block. Some blocks exhibit MSE values close to 0, indicating that these blocks are redundant.
  • Figure 3: The diagram of our progressive block drop method. Our approach adopts a multi-step progressive compression strategy. At each iteration 1) we evaluate the importance of each block and drop the least important block, and 2) we use parameter-efficient tuning techniques, and recover performance by learning from the uncompressed model through feature-level and prediction-level alignment.
  • Figure 4: Comparison and compatibility with the pruning method. Bubble size represents the model's computational complexity. Combining our method with pruning enables further 1.3$\times$ acceleration, demonstrating our compatibility with pruning.
  • Figure A: Quantitative analysis between uncompressed and pruned model. The pruned model enhances localization performance while diminishing classification performance.
  • ...and 1 more figures