Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models
Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui
TL;DR
<3-5 sentence high-level summary> MOV tackles open-vocabulary video classification by leveraging pre-trained vision-language models with multimodal inputs (video, optical flow, and audio). It uses a cross-modal cross-attention fusion mechanism to combine modalities while keeping the vision encoder fixed from pre-training and fine-tuning flow and audio backbones, enabling robust generalization to novel classes. Across Kinetics-700 and VGGSound, MOV achieves state-of-the-art zero-shot results on UCF101 and HMDB51 and demonstrates strong base-class gains as well as improved novel-class performance, with gains amplified by backbone scaling. This work highlights the potential of integrating motion and audio signals with giant VLMs for scalable, open-vocabulary video understanding.
Abstract
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.
