Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization
Jeongseok Hyun, Su Ho Han, Hyolim Kang, Joon-Young Lee, Seon Joo Kim
TL;DR
This work tackles the vocab limitation in temporal action localization by proposing STOV-TAL, a two-stage self-training framework that leverages unlabeled web videos to train a class-agnostic localizer and a vision-language model–based open-vocabulary classifier. By introducing generalized zero-shot OV-TAL benchmarks with base/novel splits and cross-dataset evaluation, the authors provide a rigorous assessment of cross-category and cross-domain generalization. Empirical results show that open-domain self-training with web videos significantly improves novel-action generalization ($mAP^{50}_{N}$) and cross-domain performance, with ViFi-CLIP and ViCLIP outperforming CLIP baselines, while large multimodal models like Gemini 1.5 offer strong baselines in some settings. The work highlights both the practical gains and remaining challenges in OV-TAL, including the need for robust evaluation schemes and the varying effectiveness of LMMs across action durations.
Abstract
The vocabulary size in temporal action localization (TAL) is limited by the scarcity of large-scale annotated datasets. To overcome this, recent works integrate vision-language models (VLMs), such as CLIP, for open-vocabulary TAL (OV-TAL). However, despite the success of VLMs trained on extensive datasets, existing OV-TAL methods still rely on human-labeled TAL datasets of limited size to train action localizers, limiting their generalizability. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our approach consists of two stages: (1) a class-agnostic action localizer is trained on a human-labeled TAL dataset to generate pseudo-labels for unlabeled videos, and (2) the large-scale pseudo-labeled dataset is then used to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we identify limitations in existing OV-TAL evaluation schemes and propose a new benchmark for thorough assessment. Finally, we showcase the TAL performance of the large multimodal model Gemini-1.5 on our new benchmark. Code is released at https://github.com/HYUNJS/STOV-TAL.
