Table of Contents
Fetching ...

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

TL;DR

MUGEN is introduced, a comprehensive benchmark evaluating this capability across speech, general audio, and music, and it is observed that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains.

Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

TL;DR

MUGEN is introduced, a comprehensive benchmark evaluating this capability across speech, general audio, and music, and it is observed that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains.

Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.
Paper Structure (17 sections, 3 figures, 3 tables)

This paper contains 17 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of MUGEN and the detailed task distribution across the seven evaluation dimensions.
  • Figure 2: Illustration of the audio-as-option design.
  • Figure 3: Performance scaling under varying numbers of audio candidates for tasks without (left) and with reference audio (right).