Table of Contents
Fetching ...

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

Shreyank N Gowda, Boyan Gao, Xiao Gu, Xiaobo Jin

TL;DR

This work presents TP-CLIP, a lightweight adaptation of CLIP for video that uses temporal visual prompting to capture motion without modifying the core architecture. By freezing the CLIP backbone and introducing a temporal encoder plus adapters, TP-CLIP achieves strong zero-shot and few-shot action recognition with far fewer tunable parameters and lower GFLOPs than prior methods. Across five datasets, it outperforms state-of-the-art approaches in multiple settings, including base-to-novel and generalized zero-shot benchmarks, while maintaining high efficiency and throughput. The results demonstrate that temporal prompting can preserve CLIP’s generalization while effectively handling video dynamics, offering a practical path toward scalable video understanding on resource-constrained setups.

Abstract

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

Is Temporal Prompting All We Need For Limited Labeled Action Recognition?

TL;DR

This work presents TP-CLIP, a lightweight adaptation of CLIP for video that uses temporal visual prompting to capture motion without modifying the core architecture. By freezing the CLIP backbone and introducing a temporal encoder plus adapters, TP-CLIP achieves strong zero-shot and few-shot action recognition with far fewer tunable parameters and lower GFLOPs than prior methods. Across five datasets, it outperforms state-of-the-art approaches in multiple settings, including base-to-novel and generalized zero-shot benchmarks, while maintaining high efficiency and throughput. The results demonstrate that temporal prompting can preserve CLIP’s generalization while effectively handling video dynamics, offering a practical path toward scalable video understanding on resource-constrained setups.

Abstract

Video understanding has shown remarkable improvements in recent years, largely dependent on the availability of large scaled labeled datasets. Recent advancements in visual-language models, especially based on contrastive pretraining, have shown remarkable generalization in zero-shot tasks, helping to overcome this dependence on labeled datasets. Adaptations of such models for videos, typically involve modifying the architecture of vision-language models to cater to video data. However, this is not trivial, since such adaptations are mostly computationally intensive and struggle with temporal modeling. We present TP-CLIP, an adaptation of CLIP that leverages temporal visual prompting for temporal adaptation without modifying the core CLIP architecture. This preserves its generalization abilities. TP-CLIP efficiently integrates into the CLIP architecture, leveraging its pre-trained capabilities for video data. Extensive experiments across various datasets demonstrate its efficacy in zero-shot and few-shot learning, outperforming existing approaches with fewer parameters and computational efficiency. In particular, we use just 1/3 the GFLOPs and 1/28 the number of tuneable parameters in comparison to recent state-of-the-art and still outperform it by up to 15.8% depending on the task and dataset.

Paper Structure

This paper contains 27 sections, 5 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparing CLIP-based models on UCF101: Our TP-CLIP (highlighted) excels with fewer parameters, surpassing recent SOTA models. Bubble size = model GFLOPS.
  • Figure 2: We add temporal prompt learning to CLIP. The temporal visual prompts (using the Temporal Encoder) are instrumental in identifying inter-frame relationships, which aids in modeling the aspect of motion in videos. Additionally, adapter modules are utilized to modify spatial features, facilitating more effective temporal learning.
  • Figure 3: Adapters ensure minimal parameter updates whilst keeping the generalization ability of the pre-trained model consistent. In comparison to (a) full fine-tuning, using (b) adapter modules significantly reduces tuneable parameter count.