Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
TL;DR
This work tackles temporal action localization by addressing the uneven difficulty of boundary prediction and the inefficiency of uniform computation. It introduces Boundary Distance Regression (BDR), a distance-based boundary formulation, and Adaptive Temporal Refinement (ATR), a differentiable, continuous-depth mechanism that concentrates computation near hard boundaries. The authors provide information-theoretic analysis comparing boundary localization strategies, show empirical gains across multiple TAL datasets, and demonstrate substantial efficiency improvements with ATR (state-of-the-art mAP at reduced FLOPs) while enabling practical training via distillation. The methods are broadly applicable, retrofittable to existing TAL models, and particularly effective for short actions where precise boundaries matter most, signaling a meaningful advance in efficient, high-precision video understanding.
Abstract
Temporal action localization requires both precise boundary detection and computational efficiency. Current methods apply uniform computation across all temporal positions, wasting resources on easy boundaries while struggling with ambiguous ones. We address this through two complementary innovations: Boundary Distance Regression (BDR), which replaces classification-based boundary detection with signed-distance regression achieving 3.3--16.7$\times$ lower variance; and Adaptive Temporal Refinement (ATR), which allocates transformer depth continuously ($τ\in[0,1]$) to concentrate computation near difficult boundaries. On THUMOS14, our method achieves 56.5\% mAP@0.7 and 58.2\% average mAP@[0.3:0.7] with 151G FLOPs, using 36\% fewer FLOPs than ActionFormer++ (55.7\% mAP@0.7 at 235G). Compared to uniform baselines, we achieve +2.9\% mAP@0.7 (+1.8\% avg mAP, 5.4\% relative) with 24\% fewer FLOPs and 29\% lower latency, with particularly strong gains on short actions (+4.2\%, 8.6\% relative). Training requires 1.29$\times$ baseline FLOPs, but this one-time cost is amortized over many inference runs; knowledge distillation further reduces this to 1.1$\times$ while retaining 99.5\% accuracy. Our contributions include: (i) a theoretically-grounded distance formulation with information-theoretic analysis showing optimal variance scaling; (ii) a continuous depth allocation mechanism avoiding discrete routing complexity; and (iii) consistent improvements across four datasets with gains correlating with boundary heterogeneity.
