Table of Contents
Fetching ...

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper

TL;DR

AVVA tackles the reliance on textual mediation in audiovisual foundation models by proposing a text-free AV learning framework that leverages Whisper and DINOv2, guided by an MRE-based data-curation pipeline. It combines a bidirectional cross-modal attention backbone with a contrastive objective and an LLM-driven five-metric scoring system to curate high-quality AV pairs. The results show competitive audio-to-video retrieval and substantially improved video-to-audio retrieval with only $192$ hours of curated data, indicating ~30x data efficiency versus DenseAV's $5{,}800$ hours. These findings underscore the value of data quality over quantity in multimodal learning and point to a scalable path toward text-free AV foundation models for real-world video understanding and interaction.

Abstract

Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

TL;DR

AVVA tackles the reliance on textual mediation in audiovisual foundation models by proposing a text-free AV learning framework that leverages Whisper and DINOv2, guided by an MRE-based data-curation pipeline. It combines a bidirectional cross-modal attention backbone with a contrastive objective and an LLM-driven five-metric scoring system to curate high-quality AV pairs. The results show competitive audio-to-video retrieval and substantially improved video-to-audio retrieval with only hours of curated data, indicating ~30x data efficiency versus DenseAV's hours. These findings underscore the value of data quality over quantity in multimodal learning and point to a scalable path toward text-free AV foundation models for real-world video understanding and interaction.

Abstract

Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increases in top-k retrieval accuracies on AudioCaps, VALOR, and VGGSound, compared to training on the full spectrum of uncurated data.

Paper Structure

This paper contains 9 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the proposed audiovisual alignment approach. (a) Our method's data curation stage uses multimodal reasoning to retain only highly aligned data. It uses LLaVA-Next-Video liu2024llavanextzhang2024llavanextvideo for video reasoning, LTU-AS gong2023joint-ltuas for audio processing, and Mistral mistral2023 for alignment scoring. (b) AVVA employs Whisper (audio), DINOv2 (video backbone), without the need for textual mediation during training.
  • Figure 2: The architecture of the MRE. The design integrates outputs of an audio-LLM and a video-LLM into a Mistral LLM to reason over audiovisual scene alignment by integrating 5 alignment scores that were calculated on the AV pairs.
  • Figure 3: The AVVA model training. Audio (Whisper) and video (DINOv2) encoders process raw inputs, which are aligned via learnable parameters in aligner layers. A Bidirectional Cross-Modal Attention helps capture the interaction between audio and video features, which are pooled to generate final embeddings for contrastive learning.
  • Figure 4: Audio-to-video model performance over hours of training data, as determined by varying the selection of the MRE score threshold, shown for Top-$k=\{1,3,10\}$ accuracies.
  • Figure 5: Cosine similarity between AVVA embeddings as a function of audio shifts and video shifts. The data points show mean similarity scores at each shift level, comparing audio embeddings against shifted audio (panel 1) and shifted video (panel 4); and vice versa, similarity of video embeddings against shifted video (panel 2) and shifted audio (panel 3).