Table of Contents
Fetching ...

Frozen CLIP Models are Efficient Video Learners

Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

TL;DR

This work tackles the compute burden of video recognition by freezing a powerful CLIP image backbone and training a lightweight Transformer decoder to fuse multi-layer features for spatiotemporal understanding. EVL introduces a video-level query token and local temporal modules (convolution, positional embedding, and cross-frame attention) to extract temporal cues without updating the backbone. Through extensive experiments on Kinetics-400 and Something-Something-v2, EVL achieves competitive accuracy with substantially reduced training time and memory, and its CLIP-based features provide complementary knowledge to supervised signals, as demonstrated by ensemble gains. The approach broadens access to high-performance video models by leveraging open-vocabulary image representations in a transfer-learning paradigm that scales with larger pretrained CLIP models and multi-layer high-resolution features.

Abstract

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

Frozen CLIP Models are Efficient Video Learners

TL;DR

This work tackles the compute burden of video recognition by freezing a powerful CLIP image backbone and training a lightweight Transformer decoder to fuse multi-layer features for spatiotemporal understanding. EVL introduces a video-level query token and local temporal modules (convolution, positional embedding, and cross-frame attention) to extract temporal cues without updating the backbone. Through extensive experiments on Kinetics-400 and Something-Something-v2, EVL achieves competitive accuracy with substantially reduced training time and memory, and its CLIP-based features provide complementary knowledge to supervised signals, as demonstrated by ensemble gains. The approach broadens access to high-performance video models by leveraging open-vocabulary image representations in a transfer-learning paradigm that scales with larger pretrained CLIP models and multi-layer high-resolution features.

Abstract

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.
Paper Structure (16 sections, 7 equations, 6 figures, 14 tables)

This paper contains 16 sections, 7 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Left: illustration of the difference between our EVL training pipeline and other video recognition methods. Right: despite that EVL targets efficient training, our models set new accuracy vs. inference FLOPS Pareto frontiers. On Kinetics-400, the 8-frame ViT-B/16 model achieves 82.9% top-1 accuracy with only 60 V100 GPU-hours of training.
  • Figure 2: Model architecture overview. (a) Top-level architecture: multiple intermediate feature maps from a massively pretrained image backbone are fed into a Transformer decoder to gather information from them. (b) Motion-enhanced Transformer decoder block: temporal modeling is added on top of raw frame features $X_i$ to retain structural information of the spatiotemporal features.
  • Figure 3: Training time vs. accuracy with frozen or finetuned backbone. Numbers in the marker are numbers of frames per view. Frozen backbone is more efficient when pretraining quality is higher.
  • Figure 4: Visualization of video-level decoder attention maps. Visualization of the 2D CLIP [CLS] token and the 3D video-level [CLS] token are provided in the top and bottom rows, respectively. Human-action-specific contents are attended more (e.g., human body, facial parts, objects in hands, moving objects).
  • Figure 5: Model ensemble and single model accuracy vs. GFLOPS on Kinetics-400.
  • ...and 1 more figures