Table of Contents
Fetching ...

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Akshita Gupta, Gaurav Mittal, Ahmed Magooda, Ye Yu, Graham W. Taylor, Mei Chen

TL;DR

LoSA tackles the challenge of scaling end-to-end Temporal Action Localization (TAL) when using large video foundation models by introducing memory- and parameter-efficient backbone adapters. It deploys Long-range and Short-range Adapters at intermediate backbone layers and a Long-Short-range Gated Fusion to produce TAL-enhanced features without backpropagating through the backbone, enabling end-to-end training on billion-parameter models such as VideoMAEv2 (ViT-g). Empirically, LoSA achieves state-of-the-art results on THUMOS-14 and ActivityNet-v1.3, outperforming head-only and existing PETL approaches while significantly reducing memory footprint (e.g., enabling E2E on ViT-g where full-backbone adaptation would OOM). The work demonstrates that specialized, efficient adapters designed for untrimmed video temporal context can unlock the full potential of large video foundations for precise action localization, with promising avenues for extending to spatio-temporal localization and multi-modal tasks.

Abstract

Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Gated Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

TL;DR

LoSA tackles the challenge of scaling end-to-end Temporal Action Localization (TAL) when using large video foundation models by introducing memory- and parameter-efficient backbone adapters. It deploys Long-range and Short-range Adapters at intermediate backbone layers and a Long-Short-range Gated Fusion to produce TAL-enhanced features without backpropagating through the backbone, enabling end-to-end training on billion-parameter models such as VideoMAEv2 (ViT-g). Empirically, LoSA achieves state-of-the-art results on THUMOS-14 and ActivityNet-v1.3, outperforming head-only and existing PETL approaches while significantly reducing memory footprint (e.g., enabling E2E on ViT-g where full-backbone adaptation would OOM). The work demonstrates that specialized, efficient adapters designed for untrimmed video temporal context can unlock the full potential of large video foundations for precise action localization, with promising avenues for extending to spatio-temporal localization and multi-modal tasks.

Abstract

Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Gated Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning.
Paper Structure (20 sections, 2 equations, 6 figures, 4 tables)

This paper contains 20 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: TAL Training Strategies/Performance. (a) Head-only Transfer Learning: Untrimmed video frames processed as independent set of clips by the frozen backbone, features concatenated after last layer, and fed to learnable TAL head. (b) Full-backbone Transfer Learning: Untrimmed video frames processed as independent set of clips by a learnable backbone, features concatenated after last layer, and fed to learnable TAL head. (c) Parameter-Efficient Transfer Learning (PETL): Untrimmed video frames processed as independent set of clips by a frozen backbone fitted with learnable adapter modules, features concatenated after last layer, and fed to learnable TAL head. Gradients backpropagate through entire backbone making PETL adapters parameter-efficient but not memory efficient. No untrimmed temporal learning in intermediate layers in (a-c). (d) LoSA (Ours): Untrimmed video frames processed jointly at each intermediate layer, enabling untrimmed temporal learning by long- and short-range adapters (green) to obtained TAL-enhanced features, and fed to learnable TAL head. No gradient backpropagating through backbone making LoSA both memory and parameter efficient. (e) On VideoMAEv2 (ViT-g) with THUMOS-14, only LoSA (d) can perform end-to-end TAL while full backbone (b) and PETL (c) leads to GPU Out of Memory error, thereby significantly outperforming head-only (a).
  • Figure 2: LoSA Overview: LoSA comprises a series of Long-range and Short-range Adapters that attach to the intermediate layers $1, \hdots, N-1$ of a video backbone. (a) Each Short-range Adapter consists of a cross-attention module that uses the video clip-level spatio-temporal features of an intermediate layer as Query and the last layer temporally concatenated features as Key and Value. (b) Similarly, each Long-range Adapter uses a cross-attention module to cross-attend the temporally concatenated long-range untrimmed video features of an intermediate layer as Query (Q) and the last layer temporally concatenated features as Key (K) and Value (V). (c) Finally, the Long-Short-range Gated Fusion module learns scaling parameters to gate the contribution of the Long-range and Short-range Adapters and combines them with the temporally concatenated last layer features via a projection layer to generate the TAL-enhanced features going into the learnable TAL head for outputting the localized action snippets.
  • Figure 3: Sensitivity analysis on THUMOS-14 using alwassel2018diagnosing. mAP$_N$ denotes normalized mAP at tIoU=0.5 with N average ground truth segments per class. Top: LoSA w/o Long-Short-Adapter. Bottom: LoSA (Ours). Performance for both XS and XL improves significantly with our method LoSA (bottom) compared to the baseline, LoSA w/o Long-Short-range Adapter (top).
  • Figure S1: Visualizations of LoSA vs. baseline (Head-only Transfer Learning) for THUMOS-14 on VideoMAEv2 (ViT-g). Across all the visualizations (a-d), LoSA is able to localize action snippets (in green) with action boundaries significantly closer to the ground truth than the baseline, leading to fewer false positives and false negatives. LoSA also predicts the action class for the snippets more accurately than the baseline (seen by incorrect class predictions in red by the baseline in (a) and (c)).
  • Figure S2: Visualizations of LoSA vs. baseline (Head-only Transfer Learning) for ActivityNet-v1.3 on VideoMAEv2 (ViT-g). Across all the visualizations (a-d), LoSA is able to localize action snippets (in green) with action boundaries significantly closer to the ground truth than the baseline, leading to fewer false positives and false negatives.
  • ...and 1 more figures