Table of Contents
Fetching ...

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

TL;DR

<3-5 sentence high-level summary> MOV tackles open-vocabulary video classification by leveraging pre-trained vision-language models with multimodal inputs (video, optical flow, and audio). It uses a cross-modal cross-attention fusion mechanism to combine modalities while keeping the vision encoder fixed from pre-training and fine-tuning flow and audio backbones, enabling robust generalization to novel classes. Across Kinetics-700 and VGGSound, MOV achieves state-of-the-art zero-shot results on UCF101 and HMDB51 and demonstrates strong base-class gains as well as improved novel-class performance, with gains amplified by backbone scaling. This work highlights the potential of integrating motion and audio signals with giant VLMs for scalable, open-vocabulary video understanding.

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

TL;DR

<3-5 sentence high-level summary> MOV tackles open-vocabulary video classification by leveraging pre-trained vision-language models with multimodal inputs (video, optical flow, and audio). It uses a cross-modal cross-attention fusion mechanism to combine modalities while keeping the vision encoder fixed from pre-training and fine-tuning flow and audio backbones, enabling robust generalization to novel classes. Across Kinetics-700 and VGGSound, MOV achieves state-of-the-art zero-shot results on UCF101 and HMDB51 and demonstrates strong base-class gains as well as improved novel-class performance, with gains amplified by backbone scaling. This work highlights the potential of integrating motion and audio signals with giant VLMs for scalable, open-vocabulary video understanding.

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.
Paper Structure (35 sections, 9 equations, 3 figures, 7 tables)

This paper contains 35 sections, 9 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Fine-tuning pre-trained CLIP with video, flow and audio modalities. For all three modalities, fine-tuning on labeled base classes leads to significant accuracy improvement. However, when evaluating the same model on novel classes, the video modality shows decreasing performance, while the performance for both flow and audio modality is improving.
  • Figure 2: An overview of the proposed multimodal open-vocabulary (MOV) method. We use the same encoder architecture from the pre-trained vision and language model to encode the video frames, optical flow and audio spectrogram. We then apply a transformer head for temporal fusion. We design a cross-attention mechanism for fusion across modalities. During training, we optimize different modalities simultaneously via calculating their similarity with the text embeddings. During inference, we use separate paths for base and novel class prediction.
  • Figure 3: Per-class improvement analysis. We show top 20 classes with the most improvement (%) and top 20 classes with the most degradation (%) when compare the proposed MOV with CLIP.