Table of Contents
Fetching ...

Audio-visual training for improved grounding in video-text LLMs

Shivprasad Sagare, Hemachandran S, Kinshuk Sarabhai, Prashant Ullegaddi, Rajeshkumar SA

TL;DR

This work investigates the role of audio in grounding video-language understanding by proposing a dual-branch video-text MLLM with separate audio (Whisper) and visual (sigLIP) encoders that feed into a compact LLM backbone (phi-2). Trained with both audio and visual signals on a video instruction-tuning dataset, the model demonstrates improved grounding compared to a vision-only baseline and a contemporary audio-vision model, across multiple evaluation frameworks. In addition, the authors release a new audio-aware benchmark comprising 120 QA samples that require audio cues, enabling more robust assessment of audio-visual grounding. The results support the value of explicit audio training for video understanding and motivate further exploration of audio-visual fusion strategies and richer benchmarks for video tasks.

Abstract

Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

Audio-visual training for improved grounding in video-text LLMs

TL;DR

This work investigates the role of audio in grounding video-language understanding by proposing a dual-branch video-text MLLM with separate audio (Whisper) and visual (sigLIP) encoders that feed into a compact LLM backbone (phi-2). Trained with both audio and visual signals on a video instruction-tuning dataset, the model demonstrates improved grounding compared to a vision-only baseline and a contemporary audio-vision model, across multiple evaluation frameworks. In addition, the authors release a new audio-aware benchmark comprising 120 QA samples that require audio cues, enabling more robust assessment of audio-visual grounding. The results support the value of explicit audio training for video understanding and motivate further exploration of audio-visual fusion strategies and richer benchmarks for video tasks.

Abstract

Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.
Paper Structure (7 sections, 2 figures, 3 tables)

This paper contains 7 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An example of improved grounding in the video-text LLM outputs, due to the additional audio signal as input.
  • Figure 2: Tensor dimensions in the figure denote the flow of data through the encoder and projector layers. Audio encoder(Whisper) and video encoder(using sigLIP) produce 64 and 829 token embeddings respectively, which are then concatenated with the text token embeddings as the final input to the LLM. Unlike previous works, we train both the audio and vision branch simultaneously using a video instruction tuning dataset.