Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition
Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yu-Ming Tang, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng
TL;DR
The paper tackles cross-domain open-vocabulary action recognition by evaluating CLIP-based video learners on the novel XOV-Action benchmark, revealing significant generalization gaps to unseen domains. It introduces Scene-Aware video-Text Alignment (SATA), which combines Scene-Aware and Action-Aware Discrimination losses to learn scene-agnostic video representations, leveraging GPT-4 generated scene suffix prompts to decouple action from scene context. Empirical results show SATA improves closed-set accuracy across diverse target domains while largely preserving open-set performance, supported by ablations and qualitative analyses. The work provides a practical benchmark and a scalable bias-mitigation approach for real-world, domain-diverse video understanding.
Abstract
Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.
