Table of Contents
Fetching ...

EZ-CLIP: Efficient Zeroshot Video Action Recognition

Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat

TL;DR

EZ-CLIP addresses zero-shot video action recognition by efficiently adapting image-based visual-language models without costly fine-tuning. It introduces temporal visual prompting (TVP) to capture cross-frame dynamics and a motion loss to emphasize temporal cues, while keeping the core CLIP backbone frozen and training only prompts and adapters. The approach is validated across five datasets, demonstrating strong zero-shot, base-to-novel, and few-shot generalization with a compact parameter footprint (~5.2M learnable parameters) and single-GPU training. This work provides a practical, scalable path for applying VL models to video, preserving generalization while achieving competitive or superior performance with significantly reduced compute.

Abstract

Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.

EZ-CLIP: Efficient Zeroshot Video Action Recognition

TL;DR

EZ-CLIP addresses zero-shot video action recognition by efficiently adapting image-based visual-language models without costly fine-tuning. It introduces temporal visual prompting (TVP) to capture cross-frame dynamics and a motion loss to emphasize temporal cues, while keeping the core CLIP backbone frozen and training only prompts and adapters. The approach is validated across five datasets, demonstrating strong zero-shot, base-to-novel, and few-shot generalization with a compact parameter footprint (~5.2M learnable parameters) and single-GPU training. This work provides a practical, scalable path for applying VL models to video, preserving generalization while achieving competitive or superior performance with significantly reduced compute.

Abstract

Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the crucial temporal aspects inherent to the video domain. In this study, we present EZ-CLIP, a simple and efficient adaptation of CLIP that addresses these challenges. EZ-CLIP leverages temporal visual prompting for seamless temporal adaptation, requiring no fundamental alterations to the core CLIP architecture while preserving its remarkable generalization abilities. Moreover, we introduce a novel learning objective that guides the temporal visual prompts to focus on capturing motion, thereby enhancing its learning capabilities from video data. We conducted extensive experiments on five different benchmark datasets, thoroughly evaluating EZ-CLIP for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization.Impressively, with a mere 5.2 million learnable parameters (as opposed to the 71.1 million in the prior best model), EZ-CLIP can be efficiently trained on a single GPU, outperforming existing approaches in several evaluations.
Paper Structure (26 sections, 13 equations, 8 figures, 13 tables)

This paper contains 26 sections, 13 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Effectiveness of EZ-CLIP: Performance comparison on UCF-101 dataset for zero-shot evaluation. The proposed method outperforms existing works while requiring fewer tunable parameters and GFLOPS with higher throughput (TP). Bubble size indicates GFLOPS during inference. (Right) Performance comparison on Something-something-v2 (SSv2) for base-to-novel evaluation. EZ-CLIP achieves significant improvement with fewer tunable parameters over existing works on this challenging dataset where motion plays a critical role.
  • Figure 2: Overview of the proposed method: Proposed approach (Right) compared to adaptation-based methods (left). EZ-CLIP leverages temporal visual prompting and motion loss to efficiently learn temporal aspects, eliminating the need for the frame integration module, a bottleneck in adapting image models for video understanding. Further details on $l^{th}$ block $B_l$ (Figure \ref{['Adapter block']}), adapter placement (section \ref{['Spatial and Language Adaptation']}), frame processing (section \ref{['Image and language adaptation']}), temporal visual prompt learning (Figure \ref{['fig:temporal_prompt']}), and the motion loss $\mathcal{L}_{motion}$ (Section \ref{['Motion loss explain']}) are provided.
  • Figure 3: Temporal visual prompting:$i)$ We initialize the temporal prompts $p_{l} \in \mathbb{R}^{T \times D}$ at the $l$-th layer of the transformer for each frame. $ii)$ The association of patch embeddings with temporal prompts is accomplished using the equation \ref{['eq_tvp']}. $iii)$ The learnable temporal prompts are processed through MHA. $iv)$ Processed prompts are concatenated with the $l$-th layer embeddings $[z_{l-1}^{(t)}, p_{l}^{(t)}]$.
  • Figure 4: Visualizing effectiveness of EZ-CLIP: Comparison of attention maps between EZ-CLIP and a base model without temporal prompts and motion loss on examples from SSv2 (left) and UCF-101 (right) validation sets. First row: input video frames, second row: result with base model, and third row results with EZ-CLIP. Left example shows action class 'Putting something next to something'. Base model primarily focuses on object appearance, while EZ-CLIP attends to object interactions and relative motion. Right example shows action class 'Baseball Pitch' where base model focuses on object appearance and background features, while EZ-CLIP tracks motion regions.
  • Figure 5: Class-wise analysis: (Left) Performance improvement with EZ-CLIP over base model (without temporal prompting and motion loss) on top-performing 20 classes. On class 0 ('pouring something on something'), even base model is performing well as motion is not important, whereas in 86 ('putting something on surface'), where motion is important, EZ-CLIP improves significantly (appearance will be same for 'picking something from surface'). (Middle) T-sne visualization of these classes from base model, and (right) t-sne visualization of features from EZ-CLIP.
  • ...and 3 more figures