Table of Contents
Fetching ...

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

Andrea Appiani, Cigdem Beyan

TL;DR

This study introduces a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models that outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

Abstract

Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

TL;DR

This study introduces a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models that outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

Abstract

Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments composed of the upper body of an individual, while the text encoder handles textual descriptions automatically generated through prompt engineering. Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The overview of the proposed approach: CLIP-VAD. Our approach entails short video segments capturing individuals' upper body, along with textual descriptions regarding their speaking status, derived through prompt engineering. The goal is to harness both video and text embeddings to enable a fusion network to determine whether the person depicted is speaking or not. The input video segment consists of 10 frames, each represented as an embedding of size 10$\times$512 after being input to the CLIP visual encoder. These 10-frame embeddings, which are 10 $\times$ 10 $\times$ 512, are averaged along the temporal channel. The central frame of these 10 frames, together with a prompt, is input to the LLaVa model Llava_paper1Llava_paper2 to generate a textual response (caption). The caption is then provided to the CLIP text encoder, resulting in a single text embedding. This text embedding is replicated 10 times and concatenated with the 10$\times$512 video embeddings to be given as an input to a Fusion Model designed as either an MLP or a Transformer network to predict the VAD label.
  • Figure 2: The extraction of visual embeddings involves using the CLIP visual encoder. That encoder takes a frame consisting of an upper body crop of an individual along with the 9 non-overlapping patches obtained from that frame. This process captures both local and global features. From 10 inputs, we obtain a frame embedding size of 10$\times$512 (1 $\times$ upper body embeddings + 9 $\times$ patchwise embeddings). This procedure is repeated for all frames within a given video segment.
  • Figure 3: Example frames from the datasets used in this paper. On the left are the Columbia and Modified Columbia datasets, and on the right is the RealVAD dataset. Green boxes indicate the active speakers, while red boxes denote other participants with VAD ground-truth, who are not speaking at that moment.
  • Figure 4: Example text responses obtained upon using LLaVA Llava_paper1Llava_paper2 with the second prompt: Is the person speaking? Explain why in a few words. The green texts are the cases where the VAD class is correctly predicted, while the red texts signify instances of incorrect predictions.