From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs
Yuhang Jia, Xu Zhang, Yujie Guo, Yang Chen, Shiwan Zhao
TL;DR
The paper addresses the semantic mismatch introduced by Audio Difference Captioning (ADC) in cross-modal audio-text learning by proposing Audio Commonality Captioning (ACC), which emphasizes shared semantics across paired audio clips. ACC is implemented via a dataset construction strategy that mixes AudioCaps with AuditEval to produce paired audio inputs and commonality-focused captions, and is used to fine-tune a Qwen2-Audio-7B multimodal LLM with LoRA-based instruction tuning. Experimental results on AudioCaps and Clotho show ACC achieving state-of-the-art captioning metrics and better generalization to speech and music tasks (VSC, SER, MIC, MGC) compared to AC and ADC, indicating stronger cross-modal alignment with preserved prior capabilities. These findings suggest ACC as a robust training objective for enhancing audio-text understanding in multimodal LLMs, balancing generalization and task-specific performance in practical applications.
Abstract
Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of Multimodal LLMs (MLLMs). To strengthen this alignment, recent works propose Audio Difference Captioning (ADC), which takes multiple audio inputs and encourages the model to describe their differences, thereby promoting fine-grained discrimination. However, despite its effectiveness, ADC introduces a semantic gap between input audios-often rich in diverse events-and the brief, difference-focused short caption. This deviation from AC-style task causes a mismatch with the pretraining objective, leading to catastrophic forgetting. To address this, we propose Audio Commonality Captioning (ACC), a comparably challenging but gentler alternative that guides the model to capture shared semantics across audio clips rather than detailed differences. Experiments show that ACC not only improves audio-text understanding on captioning benchmarks but also better preserves general capabilities across diverse speech and music tasks, confirming its ability to enable more robust cross-modal understanding and achieve a better balance between generalization and task-specific performance in MLLMs.
