Aligned Better, Listen Better for Audio-Visual Large Language Models
Yuxin Guo, Shuailei Ma, Shijie Ma, Xiaoyi Bao, Chen-Wei Xie, Kecheng Zheng, Tingyu Weng, Siyang Sun, Yun Zheng, Wei Zou
TL;DR
This work addresses the underutilization of audio in audiovisual large language models by introducing Dolphin, a fine-grained AV-LLM that aligns audio and visual modalities in both space and time. It combines an audio-visual multi-scale adapter for spatial alignment with an audio-visual interleaved merging module for temporal alignment, feeding a powerful LLM to perform instruction-following tasks. Complementing the model, the AVU dataset provides 2.13M AV pairs and 5.24M Q&A across diverse splits to train and evaluate AV understanding, aided by rigorous data filtering and meta-information integration. Experiments show Dolphin achieves state-of-the-art or competitive results on zero-shot video QA, audio-centric tasks, and AV benchmarks, while mitigating audio hallucinations, demonstrating the practical impact of fine-grained AV alignment and a purpose-built AV dataset. Overall, the work offers both architectural and dataset contributions to advance reliable audio-visual reasoning in LLMs.
Abstract
Audio is essential for multimodal video understanding. On the one hand, video inherently contains audio, which supplies complementary information to vision. Besides, video large language models (Video-LLMs) can encounter many audio-centric settings. However, existing Video-LLMs and Audio-Visual Large Language Models (AV-LLMs) exhibit deficiencies in exploiting audio information, leading to weak understanding and hallucinations. To solve the issues, we delve into the model architecture and dataset. (1) From the architectural perspective, we propose a fine-grained AV-LLM, namely Dolphin. The concurrent alignment of audio and visual modalities in both temporal and spatial dimensions ensures a comprehensive and accurate understanding of videos. Specifically, we devise an audio-visual multi-scale adapter for multi-scale information aggregation, which achieves spatial alignment. For temporal alignment, we propose audio-visual interleaved merging. (2) From the dataset perspective, we curate an audio-visual caption and instruction-tuning dataset, called AVU. It comprises 5.2 million diverse, open-ended data tuples (video, audio, question, answer) and introduces a novel data partitioning strategy. Extensive experiments show our model not only achieves remarkable performance in audio-visual understanding, but also mitigates potential hallucinations.
