Table of Contents
Fetching ...

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi

TL;DR

This work tackles WTAL by addressing the limitations of deterministic vision-language representations in fine-grained action understanding. It proposes PVLR, a probabilistic embedding framework that aligns human action knowledge with VLP knowledge, using Monte-Carlo sampling to capture temporal dynamics and uncertainty. A distribution-contrastive learning scheme (intra- and inter-distribution) further shapes a distinctive embedding space, with a VLP distillation term to transfer CLIP knowledge into the probabilistic space. Empirical results on THUMOS14 and ActivityNet v1.3 show state-of-the-art performance under weak supervision, with strong gains at higher IoU thresholds and robust generalization when integrated into other WTAL heads. The approach opens avenues for incorporating large-language models and richer textual attributes to further enhance fine-grained temporal localization.

Abstract

Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

TL;DR

This work tackles WTAL by addressing the limitations of deterministic vision-language representations in fine-grained action understanding. It proposes PVLR, a probabilistic embedding framework that aligns human action knowledge with VLP knowledge, using Monte-Carlo sampling to capture temporal dynamics and uncertainty. A distribution-contrastive learning scheme (intra- and inter-distribution) further shapes a distinctive embedding space, with a VLP distillation term to transfer CLIP knowledge into the probabilistic space. Empirical results on THUMOS14 and ActivityNet v1.3 show state-of-the-art performance under weak supervision, with strong gains at higher IoU thresholds and robust generalization when integrated into other WTAL heads. The approach opens avenues for incorporating large-language models and richer textual attributes to further enhance fine-grained temporal localization.

Abstract

Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.
Paper Structure (26 sections, 19 equations, 3 figures, 7 tables)

This paper contains 26 sections, 19 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: (a) CLIP's deterministic pre-training with image-text pairs fails to equip it with the necessary understanding of fine-grained human motion variations. (b) Earlier studies have primarily focused on the direct mapping between language models and visual input based on deterministic representation. (c) The proposed framework utilizes probabilistic embedding and aligns VLP knowledge.
  • Figure 2: Overview of the proposed PVLR. (a) Probabilistic Class Activation Sequence: For the probabilistic embedding, probabilistic adapters are augmented to facilitate the estimation of probabilistic distributions for individual snippets. (b) Leveraging VLP knowledge: We estimate probabilistic distributions and guide the model with semantic textual information corresponding to action categories. (c) Distribution Contrastive Learning: By training statistical similarities from probabilistic distribution, we aim to build distinctive embedding space.
  • Figure 3: Qualitative Results on THUMOS14. We compared the class activation sequence (CAS) of deterministic and probabilistic approaches. In this case, the red box is for the background, and the blue box is for the action.