Table of Contents
Fetching ...

Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR

This work tackles temporal action localization by addressing the uneven difficulty of boundary prediction and the inefficiency of uniform computation. It introduces Boundary Distance Regression (BDR), a distance-based boundary formulation, and Adaptive Temporal Refinement (ATR), a differentiable, continuous-depth mechanism that concentrates computation near hard boundaries. The authors provide information-theoretic analysis comparing boundary localization strategies, show empirical gains across multiple TAL datasets, and demonstrate substantial efficiency improvements with ATR (state-of-the-art mAP at reduced FLOPs) while enabling practical training via distillation. The methods are broadly applicable, retrofittable to existing TAL models, and particularly effective for short actions where precise boundaries matter most, signaling a meaningful advance in efficient, high-precision video understanding.

Abstract

Temporal action localization requires both precise boundary detection and computational efficiency. Current methods apply uniform computation across all temporal positions, wasting resources on easy boundaries while struggling with ambiguous ones. We address this through two complementary innovations: Boundary Distance Regression (BDR), which replaces classification-based boundary detection with signed-distance regression achieving 3.3--16.7$\times$ lower variance; and Adaptive Temporal Refinement (ATR), which allocates transformer depth continuously ($τ\in[0,1]$) to concentrate computation near difficult boundaries. On THUMOS14, our method achieves 56.5\% mAP@0.7 and 58.2\% average mAP@[0.3:0.7] with 151G FLOPs, using 36\% fewer FLOPs than ActionFormer++ (55.7\% mAP@0.7 at 235G). Compared to uniform baselines, we achieve +2.9\% mAP@0.7 (+1.8\% avg mAP, 5.4\% relative) with 24\% fewer FLOPs and 29\% lower latency, with particularly strong gains on short actions (+4.2\%, 8.6\% relative). Training requires 1.29$\times$ baseline FLOPs, but this one-time cost is amortized over many inference runs; knowledge distillation further reduces this to 1.1$\times$ while retaining 99.5\% accuracy. Our contributions include: (i) a theoretically-grounded distance formulation with information-theoretic analysis showing optimal variance scaling; (ii) a continuous depth allocation mechanism avoiding discrete routing complexity; and (iii) consistent improvements across four datasets with gains correlating with boundary heterogeneity.

Temporal Zoom Networks: Distance Regression and Continuous Depth for Efficient Action Localization

TL;DR

This work tackles temporal action localization by addressing the uneven difficulty of boundary prediction and the inefficiency of uniform computation. It introduces Boundary Distance Regression (BDR), a distance-based boundary formulation, and Adaptive Temporal Refinement (ATR), a differentiable, continuous-depth mechanism that concentrates computation near hard boundaries. The authors provide information-theoretic analysis comparing boundary localization strategies, show empirical gains across multiple TAL datasets, and demonstrate substantial efficiency improvements with ATR (state-of-the-art mAP at reduced FLOPs) while enabling practical training via distillation. The methods are broadly applicable, retrofittable to existing TAL models, and particularly effective for short actions where precise boundaries matter most, signaling a meaningful advance in efficient, high-precision video understanding.

Abstract

Temporal action localization requires both precise boundary detection and computational efficiency. Current methods apply uniform computation across all temporal positions, wasting resources on easy boundaries while struggling with ambiguous ones. We address this through two complementary innovations: Boundary Distance Regression (BDR), which replaces classification-based boundary detection with signed-distance regression achieving 3.3--16.7 lower variance; and Adaptive Temporal Refinement (ATR), which allocates transformer depth continuously () to concentrate computation near difficult boundaries. On THUMOS14, our method achieves 56.5\% mAP@0.7 and 58.2\% average mAP@[0.3:0.7] with 151G FLOPs, using 36\% fewer FLOPs than ActionFormer++ (55.7\% mAP@0.7 at 235G). Compared to uniform baselines, we achieve +2.9\% mAP@0.7 (+1.8\% avg mAP, 5.4\% relative) with 24\% fewer FLOPs and 29\% lower latency, with particularly strong gains on short actions (+4.2\%, 8.6\% relative). Training requires 1.29 baseline FLOPs, but this one-time cost is amortized over many inference runs; knowledge distillation further reduces this to 1.1 while retaining 99.5\% accuracy. Our contributions include: (i) a theoretically-grounded distance formulation with information-theoretic analysis showing optimal variance scaling; (ii) a continuous depth allocation mechanism avoiding discrete routing complexity; and (iii) consistent improvements across four datasets with gains correlating with boundary heterogeneity.

Paper Structure

This paper contains 66 sections, 4 theorems, 31 equations, 7 figures, 23 tables, 1 algorithm.

Key Result

Theorem 1

Let features near the true boundary $b^\ast$ follow a Gaussian similarity kernel $\mathbf{h}(t)=\phi(t)\,v$ with $\phi(t)=\exp(-{(t-b^\ast)^2}/{(2\kappa^2)})$ and $v\in\mathbb{R}^D$, $\|v\|_2=1$. Let a calibrated classifier be $p(t)=\sigma(w^\top \mathbf{h}(t))$ with $\|w\|_2=1$. Under regularity co

Figures (7)

  • Figure 1: Adaptive Temporal Refinement (ATR) architecture. Four stages: (1) a shallow transformer produces coarse predictions and uncertainty; (2) an MLP predicts continuous depth allocation $\tau_t$; (3) a deep transformer refines difficult regions; (4) residual refinement merges predictions. Boundaries are extracted via signed-distance regression, and token pruning reduces computation in low-information regions.
  • Figure 2: BDR vs Classification comparison. BDR produces sharp zero-crossings at boundaries (blue line: distance to start boundary at t=25, showing $d(t) = t - 25$ with zero-crossing only at the true boundary) while classification creates fuzzy probability regions (red). The signed distance field $d(t) = t - b(t)$ has constant gradient $|\nabla_t d| = 1$ and clear zero-crossings only at true boundaries, enabling precise localization. End boundaries are detected similarly using distance to the end boundary.
  • Figure 3: Pareto on THUMOS14. ATR dominates uniform baselines across budgets.
  • Figure 4: Feature smoothness $\kappa$ distribution across 1,220 THUMOS14 boundaries. Range: 0.8 to 6.2 frames (median 3.1), validating heterogeneous difficulty. Sharp ($\kappa<2$): 32%, medium ($2\leq\kappa\leq4$): 40%, gradual ($\kappa>4$): 28%.
  • Figure 5: Temporal correlation robustness. Variance ratio $R$ remains stable (variation $<15\%$) for $\rho<0.6$, degrading at high correlation. Real video features have $\rho\approx0.4$, validating theoretical predictions under moderate dependencies.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Theorem 1: Classification variance bound
  • Theorem 2: BDR Fisher information
  • Corollary 1: Naive Fisher bound with action-length averaging
  • Lemma 1: Finite-sample variance with approximation error