Table of Contents
Fetching ...

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

TL;DR

FROSTER tackles open-vocabulary action recognition by preserving CLIP's broad generalization while enabling video-specific adaptation. It achieves this through a residual feature distillation framework that uses a frozen CLIP as a teacher and a residual sub-network to balance learning between generalizable and video-specific features, avoiding overfitting and architecture constraints. The method is validated across base-to-novel and cross-dataset settings on multiple datasets, consistently attaining state-of-the-art performance and showing strong generalization, especially on semantically distant test sets. The approach is architecture-agnostic, data-efficient, and demonstrates improved temporal understanding without requiring temporal pretraining data, making it practically impactful for open-world video understanding.

Abstract

In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

TL;DR

FROSTER tackles open-vocabulary action recognition by preserving CLIP's broad generalization while enabling video-specific adaptation. It achieves this through a residual feature distillation framework that uses a frozen CLIP as a teacher and a residual sub-network to balance learning between generalizable and video-specific features, avoiding overfitting and architecture constraints. The method is validated across base-to-novel and cross-dataset settings on multiple datasets, consistently attaining state-of-the-art performance and showing strong generalization, especially on semantically distant test sets. The approach is architecture-agnostic, data-efficient, and demonstrates improved temporal understanding without requiring temporal pretraining data, making it practically impactful for open-world video understanding.

Abstract

In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.
Paper Structure (32 sections, 8 equations, 9 figures, 7 tables)

This paper contains 32 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Performance comparison (Top-1 Acc (%)) under the open-vocabulary evaluation setting where the models are tuned on Kinetics-400, but evaluated on UCF-101, HMDB-51, and Kinetics-600. Note that shared categories between Kinetics-600 and Kinetics-400 are excluded when testing on Kinetics-600. The numbers in the brackets denote the semantic distance between training and evaluation datasets, which is measured by Hausdorff distance on text features of category names. Larger numbers denote higher similarities.
  • Figure 2: Overall idea of our FROSTER framework. It effectively learns feature representation that is both video-specific (via simple action-based fine-tuning) and generalizable (via our proposed residual feature distillation).
  • Figure 3: Illustration of feature distillation approaches. $L_{CE}$ and $L_{FD}$ denote cross-entropy loss and feature distillation loss, respectively. (a) Directly minimizing the feature distance between the student and teacher model projector_ensemble. (b) Using a two-layer MLP to map the feature from the student space to the teacher space projector1projector2. (c) The proposed residual feature distillation, employing a modified residual network on the student model to achieve a balance in feature learning. Our method aims to simultaneously optimize video-specific knowledge and generalizability, enhancing the overall performance of the model.
  • Figure 4: Attention visualization of attention correlations between [CLS] and image tokens of the action "pour". Our method focuses on moving objects and informative backgrounds.
  • Figure 5: Relative improvements of different methods when combined with FROSTER. The number next to the legend for each dataset indicates the dataset's similarity with K-400 using Hausdorff distance (cosine similarity).
  • ...and 4 more figures