Table of Contents
Fetching ...

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang, Huan Gao, Ping Guo, Limin Wang

TL;DR

The paper tackles the challenge of adapting pre-trained short-term ViT models for temporal action detection in untrimmed videos. It presents ViT-TAD, an end-to-end framework that adds cross-snippet propagation inside the backbone (via Local and Global Propagation Blocks) and post-backbone temporal Transformer layers to model long-range temporal structure while keeping computation low. With a simple TAD head and VideoMAE pretraining, ViT-TAD achieves strong results across THUMOS14 (69.5 average mAP), ActivityNet-1.3 (37.40 average mAP), and FineAction (17.20 average mAP), outperforming several prior end-to-end and feature-extracted baselines. The approach provides a practical, scalable baseline that leverages powerful pre-trained ViT representations for unified long-form video modeling in TAD.

Abstract

Vision Transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone.For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

TL;DR

The paper tackles the challenge of adapting pre-trained short-term ViT models for temporal action detection in untrimmed videos. It presents ViT-TAD, an end-to-end framework that adds cross-snippet propagation inside the backbone (via Local and Global Propagation Blocks) and post-backbone temporal Transformer layers to model long-range temporal structure while keeping computation low. With a simple TAD head and VideoMAE pretraining, ViT-TAD achieves strong results across THUMOS14 (69.5 average mAP), ActivityNet-1.3 (37.40 average mAP), and FineAction (17.20 average mAP), outperforming several prior end-to-end and feature-extracted baselines. The approach provides a practical, scalable baseline that leverages powerful pre-trained ViT representations for unified long-form video modeling in TAD.

Abstract

Vision Transformer (ViT) has shown high potential in video recognition, owing to its flexible design, adaptable self-attention mechanisms, and the efficacy of masked pre-training. Yet, it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue, this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation, while still keeping low computation overhead and memory consumption for efficient TAD. To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation, we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone.For post-backbone information propagation, we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors, riching up to 69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction.
Paper Structure (26 sections, 1 equation, 6 figures, 10 tables)

This paper contains 26 sections, 1 equation, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Different input processing between baseline and our ViT-TAD. The dashed box illustrates the feature modeling within the backbone. In contrast to the baseline approach which models each snippet individually, our ViT-TAD allows snippets to collaboratively interact with each other during the modeling process within the backbone.
  • Figure 2: Overview of ViT-TAD. We suppose the ViT-based backbone has $n$ blocks and divide them into several subsets. Each subset has $i$ blocks. We divide a video clip into several snippets and send each into the backbone for feature extraction. We perform temporal feature interaction among all snippets through the inner-backbone information propagation strategy. We further conduct clip-level modeling to refine clip-level features through the post-backbone information propagation strategy. $*$ means the last layer is initialized as zero.
  • Figure 3: Temporal positional encoding for all snippets. Original input consists of a snippet and its snippet-level PE. We add a learnable additional temporal PE, called clip-level PE here. PE is short for positional encoding.
  • Figure 4: Comparison between 1D and 3D Propagation Strategy. (a) The 3D setting: the ($i+1$)th block takes all snippets as input and directly applies spatiotemporal self-attention to the whole video clip. (b)The 1D setting: global propagation block is inserted between consecutive backbone blocks.
  • Figure 5: Error analysis of our ViT-TAD. There are error rates of 5 types on top-10G predictions, where G denotes the number of ground truths.
  • ...and 1 more figures