Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models
Rui Hu, Delai Qiu, Shuyu Wei, Jiaming Zhang, Yining Wang, Shengping Liu, Jitao Sang
TL;DR
This work addresses the gap in omnimodal large language models between vision-text and vision-audio capabilities, showing that vision-audio integration is hindered by weaker vision-audio alignment learned during training. It proposes Self-Knowledge Distillation, where the vision-text pathway acts as a teacher to guide the vision-audio pathway through KL divergence, with the overall objective combining $L_{Self-KD}$ and $L_{SFT}$ as $L = \alpha L_{Self-KD} + (1-\alpha) L_{SFT}$. Across multiple base models and VA benchmarks, Self-KD reduces the VL-VA gap and yields more VL-like attention and MMAlign performance, though VA still trails VL in absolute terms. The results demonstrate a practical, model-agnostic approach to enhancing audio-visual interactions in OLLMs, with implications for building more robust multimodal agents that handle audio queries as effectively as text.
Abstract
Omnimodal Large Language Models (OLLMs) have shown significant progress in integrating vision and text, but still struggle with integrating vision and audio, often exhibiting suboptimal performance when processing audio queries compared to text queries. This disparity is primarily due to insufficient alignment between vision and audio modalities during training, leading to inadequate attention to visual information when using audio queries. To mitigate this issue, we propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student. This enables the model to process audio in a manner analogous to its text processing. Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs by learning from the vision-text components, which subsequently improves the interaction between audio and images and results in improved performance on multimodal tasks.
