Table of Contents
Fetching ...

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Graham W. Taylor

TL;DR

OVFormer addresses open vocabulary temporal action localization by combining rich language descriptions with multimodal video features. It generates class specific descriptions from a large language model, aligns them with frame and snippet level video features via a modality mixer, and trains in two stages to generalize to novel categories. The approach extends ActionFormer to open vocabulary settings and achieves state-of-the-art results on THUMOS14 and ActivityNet-1.3 for base and novel classes. The results demonstrate strong generalization to unseen actions while preserving performance on base actions, enabling practical open world video understanding.

Abstract

Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.

Open-Vocabulary Temporal Action Localization using Multimodal Guidance

TL;DR

OVFormer addresses open vocabulary temporal action localization by combining rich language descriptions with multimodal video features. It generates class specific descriptions from a large language model, aligns them with frame and snippet level video features via a modality mixer, and trains in two stages to generalize to novel categories. The approach extends ActionFormer to open vocabulary settings and achieves state-of-the-art results on THUMOS14 and ActivityNet-1.3 for base and novel classes. The results demonstrate strong generalization to unseen actions while preserving performance on base actions, enabling practical open world video understanding.

Abstract

Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard temporal action localization, where training and test categories are predetermined, OVTAL requires understanding contextual cues that reveal the semantics of novel categories. To address these challenges, we introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions. First, we employ task-specific prompts as input to a large language model to obtain rich class-specific descriptions for action categories. Second, we introduce a cross-attention mechanism to learn the alignment between class representations and frame-level video features, facilitating the multimodal guided features. Third, we propose a two-stage training strategy which includes training with a larger vocabulary dataset and finetuning to downstream data to generalize to novel categories. OVFormer extends existing TAL methods to open-vocabulary settings. Comprehensive evaluations on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our method. Code and pretrained models will be publicly released.
Paper Structure (32 sections, 4 equations, 11 figures, 5 tables)

This paper contains 32 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of OVFormer. Given a long untrimmed video $X$, frame- and snippet-level features are extracted and projected into $D$-dimensional feature spaces $Z_F$ and $Z_V$ using the projection functions $P_F$ and $P_V$, respectively. These features are then passed as input to the multi-scale $\phi_{ENC}$ module, which includes our proposed modality mixer. The modality mixer takes $Z_F$ and $Z_V$ as input, where $Z_V$ undergoes self-attention, and $Z_F$ is cross-attended with text embeddings $Z_L$ obtained from LLM-generated descriptions. The resulting multimodal guided features are fused with the self-attended $Z_V$. The output of $\phi_{ENC}$, enriched multimodal snippet-level features $Z$, is used as input for $\phi_{DEC}$, which consists of OV-classification and regression heads. The OV-classification head maps the enriched multimodal snippet-level features to the semantic space, relating them to class semantics and obtaining action candidates. During inference, text embeddings of novel categories are used to enable the OV capability.
  • Figure 2: Design choices for the modality mixer which are used as baselines for the OVTAL setting and evaluated in \ref{['table:ovtal_sota']}. From (a-d) the text embeddings $Z_L$ are introduced in the OV-classification head (a) Naïve solution where only snippet-level features. (b) Introduce text embeddings and cross-attend with the snippet-level features. (c) A variation on (b) where frame-level features are cross-attended with snippet-level features. (d) Our proposed method cross-attends text embeddings with frame-level features to learn multimodal guided features, which is fused with snippet-level features.
  • Figure 3: Finetuning strategies by freezing or finetuning the $\phi_{ENC}$/$\phi_{DEC}$ on OVTAL setting. Here, for showing the effectiveness of Stage II, Stage I of the training pipeline is always present.
  • Figure 4: OVFormer performance on THUMOS14 in the OVTAL setting. We compare the performance of P-ActionFormer (\ref{['Fig:variations']}(a)) and OVFormer (\ref{['Fig:variations']}(d)) on (a) the billiards action, and (b) the tennis swing and golf swing actions.
  • Figure A5: Class-wise average $mAP$ for THUMOS14 for 75-25 train-test split.
  • ...and 6 more figures