Temporal Action Detection Model Compression by Progressive Block Drop
Xiaoyong Chen, Yong Guo, Jiaming Liang, Sitong Zhuang, Runhao Zeng, Xiping Hu
TL;DR
This work tackles the high computational cost of temporal action detection by introducing a depth-focused model compression, Progressive Block Drop. By iteratively dropping blocks guided by a Block Selection Evaluator and recovering performance with Cross-Depth Alignment using LoRA-based fine-tuning, the approach reduces MACs by about $25\%$ on THUMOS14 and ActivityNet-1.3 while maintaining or even improving accuracy, and it remains complementary to channel pruning. The method emphasizes that deeper networks are not always more efficient on GPUs, since large matrix computations benefit from fewer, wider layers rather than many small ones, and it demonstrates robust improvements across multiple TAD architectures and datasets, including FineActions and AdaTAD, as well as a natural language localization task. The practical impact lies in enabling faster, more resource-efficient TAD deployment in real-world settings like robotics and autonomous systems, with broad compatibility with existing pruning techniques and potential for further acceleration with sparse activations. Overall, the paper provides a concrete, generalizable strategy for depth-wise compression that preserves performance and enhances inference speed in temporal action detection.
Abstract
Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.
