Table of Contents
Fetching ...

Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization

Jia-Run Du, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng

TL;DR

A novel model named Generalizable Action Proposal generator (GAP) is proposed, which can interface seamlessly with CLIP and generate action proposals in a holistic way and is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing.

Abstract

To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.

Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization

TL;DR

A novel model named Generalizable Action Proposal generator (GAP) is proposed, which can interface seamlessly with CLIP and generate action proposals in a holistic way and is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing.

Abstract

To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for \textit{unseen} categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively.
Paper Structure (21 sections, 8 equations, 5 figures, 4 tables)

This paper contains 21 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Left: Zero-shot temporal action localization requires the model trained on seen action categories to be generalizable in detecting and classifying unseen action categories during inference. Right: Visualization of the action proposals generated by STALE nag2022zero, EffPrompt ju2022prompting and our GAP. The "mIoU" denotes the mean Intersection over Union, which evaluates the completeness of predicted proposals. We can find that our GAP generates more complete action proposals and has a higher mIoU score than the compared frame-level methods. Best viewed in color.
  • Figure 2: Left: The pipeline of our method. We adopt a video of $T=8$ with $N_q=5$ predicted action proposals for example. Right: An illustration of the motivation of Staitc-Dynamic Rectifying. The red and blue areas in the horizontal bar represent two predicted action proposals. Top: Detection by leveraging only dynamic information may result in incomplete proposals, where the model focuses on salient dynamic parts. Bottom: After cooperating with static and dynamic information, the proposals are refined by interacting with proposals exhibiting consistent static information to approach ground truth. Best viewed in color.
  • Figure 3: An illustration of our proposed GAP. Specifically, given the video feature $X$ extracted by the visual encoder, which is fed into the temporal encoder for temporal dynamics modeling. And an Action-aware Discrimination loss $\mathcal{L}_{ad}$ is used to enhance the temporal modeling by distinguishing action from the background. Next, the temporal decoder is adopted to generate dynamic-aware action queries. Then, the static information is injected into dynamic-aware action queries by the Static-Dynamic Rectifying module for refinement. Finally, action proposals are generated and supervised by the detection loss $\mathcal{L}_{det}$. Best viewed in color.
  • Figure 4: Visualization of the three action proposals before and after the Static-Dynamic Rectifying module, without retraining. The same color represents the result from the same action proposal. Best viewed in color.
  • Figure 5: Performance of different number of action queries. AVG mAP denotes the average mAP for IoU thersholds from 0.1 to 0.7 with 0.1 increment. All experiments are performed in the split 75% v.s. 25% on the Thumos14 dataset. Best viewed in color.