Table of Contents
Fetching ...

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yu-Ming Tang, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng

TL;DR

The paper tackles cross-domain open-vocabulary action recognition by evaluating CLIP-based video learners on the novel XOV-Action benchmark, revealing significant generalization gaps to unseen domains. It introduces Scene-Aware video-Text Alignment (SATA), which combines Scene-Aware and Action-Aware Discrimination losses to learn scene-agnostic video representations, leveraging GPT-4 generated scene suffix prompts to decouple action from scene context. Empirical results show SATA improves closed-set accuracy across diverse target domains while largely preserving open-set performance, supported by ablations and qualitative analyses. The work provides a practical benchmark and a scalable bias-mitigation approach for real-world, domain-diverse video understanding.

Abstract

Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

TL;DR

The paper tackles cross-domain open-vocabulary action recognition by evaluating CLIP-based video learners on the novel XOV-Action benchmark, revealing significant generalization gaps to unseen domains. It introduces Scene-Aware video-Text Alignment (SATA), which combines Scene-Aware and Action-Aware Discrimination losses to learn scene-agnostic video representations, leveraging GPT-4 generated scene suffix prompts to decouple action from scene context. Empirical results show SATA improves closed-set accuracy across diverse target domains while largely preserving open-set performance, supported by ablations and qualitative analyses. The work provides a practical benchmark and a scalable bias-mitigation approach for real-world, domain-diverse video understanding.

Abstract

Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.
Paper Structure (11 sections, 4 equations, 5 figures, 4 tables)

This paper contains 11 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Can CLIP-based video learners effectively generalize to unseen test domains? For example, given a model trained with normal videos, we wonder if it can effectively recognize actions in dark videos. (b) We conduct a comprehensive evaluation for five state-of-the-art CLIP-based video learners on four test datasets, namely UCF DBLP:journals/corr/abs-1212-0402, HMDB DBLP:conf/iccv/KuehneJGPS11, ARID xu2020arid and NEC-Dr DBLP:conf/wacv/ChoiSCH20. For each test dataset, we report the accuracy of closed-set and open-set action categories according to the training categories in Kinetics400 DBLP:conf/cvpr/CarreiraZ17. As shown above, these CLIP-based video learners exhibit limited performance when recognizing actions in unseen test domains. Note that, for each metric, we report the best performance among all methods, please refer to Table \ref{['tab:k400']} for the full results.
  • Figure 2: An overview of our proposed Scene-Aware video-Text Alignment (SATA) method. Our method basically includes a video encoder and a text encoder for representation extraction, with a contrastive loss for video-text alignment. Based on the scene-encoded text prompts, we propose the Scene-Aware Discrimination and Action-Aware Discrimination losses, aiming to learn scene-agnostic video representations for cross-domain open-vocabulary action recognition. Best viewed in color.
  • Figure 3: Quantitative analysis of the total number of scene suffixes.
  • Figure 4: Quantitative analysis of the coefficient $\lambda_{\text{scene}}$ for the Scene-Aware Discrimination loss by closed-set accuracy on four test domains. The horizontal axis shows the value of $\lambda_{\text{scene}}$.
  • Figure 5: Qualitative analysis by t-SNE van2008visualizing.