MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang; Yun-Shao Tsai; Yu-Kai Guo; Ping-Le Tsai; Yen-Ting Piao; Hung-Wei Chen; Ting-Lin Hsiao; Yun-Man Hsu; Ke-Han Lu; Hung-yi Lee

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo, Ping-Le Tsai, Yen-Ting Piao, Hung-Wei Chen, Ting-Lin Hsiao, Yun-Man Hsu, Ke-Han Lu, Hung-yi Lee

TL;DR

MUGEN is introduced, a comprehensive benchmark evaluating this capability across speech, general audio, and music, and it is observed that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains.

Abstract

While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-audio settings, and performance degrades sharply as the number of concurrent audio inputs increases, identifying input scaling as a fundamental bottleneck. We further investigate training-free strategies and observe that Audio-Permutational Self-Consistency, which diversifies the order of audio candidates, helps models form more robust aggregated predictions, yielding up to 6.28% accuracy gains. Combining this permutation strategy with Chain-of-Thought further improves performance to 6.74%. These results expose blind spots in current LALMs and provide a foundation for evaluating complex auditory comprehension.

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

TL;DR

Abstract

Paper Structure (17 sections, 3 figures, 3 tables)

This paper contains 17 sections, 3 figures, 3 tables.

Introduction
MUGEN Benchmark
Overview
Evaluation Dimensions
Data Sources
Experimental Setups
Baselines
Evaluation Metric
Evaluation Results
Main Results
Performance Scaling with the Number of Audio Inputs
Improvement Strategies
Methodology
Results
Conclusion
...and 2 more sections

Figures (3)

Figure 1: Overview of MUGEN and the detailed task distribution across the seven evaluation dimensions.
Figure 2: Illustration of the audio-as-option design.
Figure 3: Performance scaling under varying numbers of audio candidates for tasks without (left) and with reference audio (right).

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

TL;DR

Abstract

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)