Table of Contents
Fetching ...

Open Vocabulary Multi-Label Video Classification

Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi

TL;DR

Open-vocabulary multi-label video classification requires recognizing multiple concepts from arbitrary vocabularies in videos. The authors adapt a pre-trained vision-language model (CLIP) with an LLM-guided label encoder using learnable prefixes and a prompt transformer, and they augment the vision encoder with a regularized temporal modeling branch. The method includes a multi-label training objective with scores $s(l,v)=f_t(l)^T f_v(v)$ and $p(l,v)=\sigma\left(\frac{s(l,v)}{\tau}\right)$, along with a vocabulary expansion strategy via synthetic labels derived from multimodal LLMs. Experiments across five datasets demonstrate strong open-vocabulary performance and robust calibration, with ablations validating the efficacy of learnable prompting, temporal modeling, and regularization.

Abstract

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.

Open Vocabulary Multi-Label Video Classification

TL;DR

Open-vocabulary multi-label video classification requires recognizing multiple concepts from arbitrary vocabularies in videos. The authors adapt a pre-trained vision-language model (CLIP) with an LLM-guided label encoder using learnable prefixes and a prompt transformer, and they augment the vision encoder with a regularized temporal modeling branch. The method includes a multi-label training objective with scores and , along with a vocabulary expansion strategy via synthetic labels derived from multimodal LLMs. Experiments across five datasets demonstrate strong open-vocabulary performance and robust calibration, with ablations validating the efficacy of learnable prompting, temporal modeling, and regularization.

Abstract

Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP's vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
Paper Structure (31 sections, 3 equations, 10 figures, 14 tables)

This paper contains 31 sections, 3 equations, 10 figures, 14 tables.

Figures (10)

  • Figure 1: Our task: recognize multiple open-vocabulary classes in videos at inference from an open vocabulary, including entities (in blue) such as objects and scenes, and actions (in red). Prior zero-shot image classification approaches (e.g. CLIP) can label the salient entity in each frame (left), while zero shot action-recognition approaches (e.g. ViFi-CLIP) (middle) can classify video-level action. Our method (right) can recognize all action or entity classes present in the video.
  • Figure 2: Our open vocabulary classification method includes three stages of operation. During the (a) training stage, we train our label and video encoders on closed set training labels. New class labels can be added to the vocabulary after training by employing the (b) classifier vocabulary expansion stage. During the (c) inference stage video embeddings are computed and matched with the label embeddings database to get the classification scores.
  • Figure 3: Our end-to-end trainable system for open vocabulary video classification. The class labels are used by the LLM to generate useful class attributes for the CLIP text encoder which provides a visually aligned label embedding. The learnable input prompts to the LLM guide it to generate soft-attributes useful for video classification. Prompting transformer learns to map from the LLM output space to the CLIP input space. To add video understanding to CLIP's vision encoder we add additional spatio-temporal modeling layers. Details about each component are in Section \ref{['sec:method']}.
  • Figure 4: Our Temporal Block projects frame patch tokens from the CLIP image encoder, fusing them with previous block's temporal branch tokens. The temporal token (TMP) incorporates all frames' CLS tokens. Divided Space-Time attention layers form the core of the block. Spatial attention layers are initialized from CLIP weights and regularized using stochastic weight averaging.
  • Figure 5: Our training objective applies binary cross-entropy to predicted Video-Label feature similarities sharpened by a temperature-scaled sigmoid.
  • ...and 5 more figures