Table of Contents
Fetching ...

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim

TL;DR

This work tackles the vocab limitation in temporal action localization by proposing STOV-TAL, a two-stage self-training framework that leverages unlabeled web videos to train a class-agnostic localizer and a vision-language model–based open-vocabulary classifier. By introducing generalized zero-shot OV-TAL benchmarks with base/novel splits and cross-dataset evaluation, the authors provide a rigorous assessment of cross-category and cross-domain generalization. Empirical results show that open-domain self-training with web videos significantly improves novel-action generalization ($mAP^{50}_{N}$) and cross-domain performance, with ViFi-CLIP and ViCLIP outperforming CLIP baselines, while large multimodal models like Gemini 1.5 offer strong baselines in some settings. The work highlights both the practical gains and remaining challenges in OV-TAL, including the need for robust evaluation schemes and the varying effectiveness of LMMs across action durations.

Abstract

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally, we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

TL;DR

This work tackles the vocab limitation in temporal action localization by proposing STOV-TAL, a two-stage self-training framework that leverages unlabeled web videos to train a class-agnostic localizer and a vision-language model–based open-vocabulary classifier. By introducing generalized zero-shot OV-TAL benchmarks with base/novel splits and cross-dataset evaluation, the authors provide a rigorous assessment of cross-category and cross-domain generalization. Empirical results show that open-domain self-training with web videos significantly improves novel-action generalization () and cross-domain performance, with ViFi-CLIP and ViCLIP outperforming CLIP baselines, while large multimodal models like Gemini 1.5 offer strong baselines in some settings. The work highlights both the practical gains and remaining challenges in OV-TAL, including the need for robust evaluation schemes and the varying effectiveness of LMMs across action durations.

Abstract

The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally, we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.
Paper Structure (26 sections, 1 equation, 7 figures, 9 tables)

This paper contains 26 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: (a) Our two-stage self-training pipeline. (b) Two types of data sources are explored for self-training: in-domain (ID), including videos from novel categories of the target benchmark, and open-domain (OD), including random web videos. The scalability of our self-training approach is demonstrated by increasing mAPs of both base and novel actions, showing improved generalizability.
  • Figure 2: (Top) Training an action localizer on base categories reduces recall on novel categories. Our self-training approach mitigates this issue. (Bottom) Partial tuning CLIP on base actions improves accuracy on novel actions, but this improvement diminishes before base accuracy saturates. In contrast, ViFi-CLIP, fully tuned with large video-text data, achieves better overall accuracy and does not benefit from partial tuning on a small-scale TAL dataset.
  • Figure 3: Comparison of ZS-TAL and OV-TAL training. In ZS-TAL, the training dataset is confined to $\mathcal{C}_{train}$, which is strictly separated from $\mathcal{C}_{test}$. OV-TAL allows the use of videos without TAL labels, even if these videos may contain actions of $\mathcal{C}_{test}$.
  • Figure 4: Architecture. Features $\mathbf{F}_{\mathcal{V}}$ and $\mathbf{F}_{\mathcal{T}}$ are generated by VLM video and text encoders from video frames and action names, respectively The action localizer detects class-agnostic action instances, and their features ($\mathbf{F}_{\mathcal{A}}$) are extracted using RoI-Align. Cosine similarities between $\mathbf{F}_{\mathcal{A}}$ and $\mathbf{F}_{\mathcal{T}}$ are computed to assign the top-scoring category to each action instance. The category score ($s_c$) is averaged with its actionness score ($s_a$) to obtain the final confidence score ($s$).
  • Figure 5: Sensitivity analysis of pseudo-labels in self-training.
  • ...and 2 more figures