Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization
Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi
TL;DR
This work tackles WTAL by addressing the limitations of deterministic vision-language representations in fine-grained action understanding. It proposes PVLR, a probabilistic embedding framework that aligns human action knowledge with VLP knowledge, using Monte-Carlo sampling to capture temporal dynamics and uncertainty. A distribution-contrastive learning scheme (intra- and inter-distribution) further shapes a distinctive embedding space, with a VLP distillation term to transfer CLIP knowledge into the probabilistic space. Empirical results on THUMOS14 and ActivityNet v1.3 show state-of-the-art performance under weak supervision, with strong gains at higher IoU thresholds and robust generalization when integrated into other WTAL heads. The approach opens avenues for incorporating large-language models and richer textual attributes to further enhance fine-grained temporal localization.
Abstract
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.
