Table of Contents
Fetching ...

Aligned Better, Listen Better for Audio-Visual Large Language Models

Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou

TL;DR

This work addresses the underutilization of audio in audiovisual large language models by introducing Dolphin, a fine-grained AV-LLM that aligns audio and visual modalities in both space and time. It combines an audio-visual multi-scale adapter for spatial alignment with an audio-visual interleaved merging module for temporal alignment, feeding a powerful LLM to perform instruction-following tasks. Complementing the model, the AVU dataset provides 2.13M AV pairs and 5.24M Q&A across diverse splits to train and evaluate AV understanding, aided by rigorous data filtering and meta-information integration. Experiments show Dolphin achieves state-of-the-art or competitive results on zero-shot video QA, audio-centric tasks, and AV benchmarks, while mitigating audio hallucinations, demonstrating the practical impact of fine-grained AV alignment and a purpose-built AV dataset. Overall, the work offers both architectural and dataset contributions to advance reliable audio-visual reasoning in LLMs.

Abstract

Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

Aligned Better, Listen Better for Audio-Visual Large Language Models

TL;DR

This work addresses the underutilization of audio in audiovisual large language models by introducing Dolphin, a fine-grained AV-LLM that aligns audio and visual modalities in both space and time. It combines an audio-visual multi-scale adapter for spatial alignment with an audio-visual interleaved merging module for temporal alignment, feeding a powerful LLM to perform instruction-following tasks. Complementing the model, the AVU dataset provides 2.13M AV pairs and 5.24M Q&A across diverse splits to train and evaluate AV understanding, aided by rigorous data filtering and meta-information integration. Experiments show Dolphin achieves state-of-the-art or competitive results on zero-shot video QA, audio-centric tasks, and AV benchmarks, while mitigating audio hallucinations, demonstrating the practical impact of fine-grained AV alignment and a purpose-built AV dataset. Overall, the work offers both architectural and dataset contributions to advance reliable audio-visual reasoning in LLMs.

Abstract

Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.

Paper Structure

This paper contains 60 sections, 6 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: (a) Audio visual capability of previous AV-LLMs and our Dolphin. We pose questions separately for audio-video and audio, discovering that VideoLLaMA and VideoLLaMA 2 exhibit significant hallucinations for audio understanding, while Dolphin produces accurate responses. (b) Audio could provide complementary information compared to video. Incorporating audio into training greatly enhances video understanding.
  • Figure 2: Overview of our Dolphin, which aligns on both spatial and temporal dimensions to fully exploit the natural consistency of videos and enhance the complementary roles of vision and hearing. Specifically, for spatial alignment, we introduced an audio-visual multi-scale adapter using a dual-feature pathway design, which extracts multi-scale features from both visual and auditory inputs and achieves fine-grained alignment with the respective modality.
  • Figure 3: The integration pipeline of the audio-visual understanding dataset (AVU-dataset).
  • Figure 4: Performance comparison of different task-specific experts.
  • Figure 5: Examples of prompt templates for generating AVU-dataset, others are in the appendix.
  • ...and 6 more figures