Table of Contents
Fetching ...

Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci

TL;DR

The paper tackles zero-shot temporal action localization (ZS-TAL) in settings where training data is unavailable. It introduces T3AL, a training-data-free approach that performs test-time adaptation of a pre-trained Vision-Language Model on a per-video basis, avoiding supervised training altogether. T3AL operates in three steps: (i) video-level pseudo-labeling to select a candidate action class, (ii) self-supervised refinement with a BYOL-like objective to sharpen frame-level predictions, and (iii) text-guided region suppression using captions from CoCa to prune implausible proposals. Experiments on THUMOS14 and ActivityNet-v1.3 show that T3AL outperforms naive VLM baselines and achieves competitive gains over training-based ZS-TAL methods in zero-shot settings, illustrating the viability of test-time adaptation for temporal action localization. The work also analyzes cross-dataset generalization and provides ablations and qualitative insights, outlining limitations and future directions toward robust, data-free TAL in real-world scenarios.

Abstract

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Test-Time Zero-Shot Temporal Action Localization

TL;DR

The paper tackles zero-shot temporal action localization (ZS-TAL) in settings where training data is unavailable. It introduces T3AL, a training-data-free approach that performs test-time adaptation of a pre-trained Vision-Language Model on a per-video basis, avoiding supervised training altogether. T3AL operates in three steps: (i) video-level pseudo-labeling to select a candidate action class, (ii) self-supervised refinement with a BYOL-like objective to sharpen frame-level predictions, and (iii) text-guided region suppression using captions from CoCa to prune implausible proposals. Experiments on THUMOS14 and ActivityNet-v1.3 show that T3AL outperforms naive VLM baselines and achieves competitive gains over training-based ZS-TAL methods in zero-shot settings, illustrating the viability of test-time adaptation for temporal action localization. The work also analyzes cross-dataset generalization and provides ablations and qualitative insights, outlining limitations and future directions toward robust, data-free TAL in real-world scenarios.

Abstract

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
Paper Structure (15 sections, 11 equations, 5 figures, 9 tables)

This paper contains 15 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Task setup. Previous approaches tackling ZS-TAL (a) train the model on labelled data and test it in-domain. Due to lack of out-of-distribution generalization, we propose to update the parameters at test-time on a stream of unlabelled videos without prior supervised training (b).
  • Figure 2: Cross-dataset generalization. We show the average mAP, computed at IoU thresholds of [$0.3$:$0.1$:$0.7$], for EffPrompt and STALE trained and tested on THUMOS14, and trained on a different dataset and tested on THUMOS14. We report results for the 75:25 (75% seen classes) and 50:50 (50% seen classes) evaluation settings.
  • Figure 3: Overview of the proposed method.$T3AL$ addresses the task of ZS-TAL by only learning at test-time on unlabelled data. We first compare the average visual frames with the textual class names to identify the video pseudo-label . We then refine the visual frames-video pseudo-label scores with self-supervision. Last, we exploit the decoder of a captioning model (i.e., CoCa yu2022coca) to generate captions and perform text-guided region suppression. We only fine-tune the vision and language projectors, while keeping the encoders frozen. Once the prediction is obtained, the optimized parameters $\theta_{\mathcal{P}_V}^\ast$ and $\theta_{\mathcal{P}_L}^\ast$ are re-initialized to the ones of the pre-trained model.
  • Figure 4: Oracle study. We re-evaluate our configuration with partial perfect information as perfect class prediction for the pseudo-label, perfect regions count selection in the video, and perfect selection of positive and negative refinement samples. With all perfect mechanisms, we surpass training-based models.
  • Figure 5: Captions generated from frames in video named video_test_0000793.txt .