Table of Contents
Fetching ...

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

TL;DR

This work introduces ADU-Bench, a four-dataset benchmark (ADU-General, ADU-Skill, ADU-Multilingual, ADU-Ambiguity) to evaluate open-ended audio-dialogue understanding in Large Audio-Language Models across general dialogue, specialized skills, multilingual contexts, and phonetic ambiguities. Ground-truth references are generated via GPT-4 or human annotation, and an LLM-based evaluation framework (with cross-checks for bias and human alignment) assesses 16 diverse LALMs, including GPT-4o as a top performer. Key findings show substantial gaps in handling mathematical notation, roleplay, multilingual understanding, and phonetic ambiguity, underscoring the need for more robust audio-dialogue capabilities. ADU-Bench offers a scalable, data-rich platform to drive development and standardized evaluation of future audio-dialogue systems.

Abstract

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

TL;DR

This work introduces ADU-Bench, a four-dataset benchmark (ADU-General, ADU-Skill, ADU-Multilingual, ADU-Ambiguity) to evaluate open-ended audio-dialogue understanding in Large Audio-Language Models across general dialogue, specialized skills, multilingual contexts, and phonetic ambiguities. Ground-truth references are generated via GPT-4 or human annotation, and an LLM-based evaluation framework (with cross-checks for bias and human alignment) assesses 16 diverse LALMs, including GPT-4o as a top performer. Key findings show substantial gaps in handling mathematical notation, roleplay, multilingual understanding, and phonetic ambiguity, underscoring the need for more robust audio-dialogue capabilities. ADU-Bench offers a scalable, data-rich platform to drive development and standardized evaluation of future audio-dialogue systems.

Abstract

Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.

Paper Structure

This paper contains 25 sections, 4 figures, 22 tables.

Figures (4)

  • Figure 1: ADU-Bench evaluates the open-ended audio dialogue understanding for LALMs, where users interact with LALMs directly through audio. Our ADU-Bench consists of 4 datasets, including (a) ADU-General dataset, (b) ADU-Skill dataset, (c) ADU-Multilingual dataset, and (d) ADU-Ambiguity dataset. In total, it encompasses 20,715 open-ended audio dialogues, comprising over 8,000 real-world recordings alongside synthetic audio samples.
  • Figure 2: The average scores across each domain for 4 datasets within ADU-Bench under 16 LALMs.
  • Figure 3: Ablation study on ADU-Bench. (a) Real-world and synthetic audio can both serve as evaluation sources. (b) GPT-4 evaluator is aligned with human evaluation. (c) Scoring twice is necessary to eliminate the position bias.
  • Figure 4: The evaluation method in ADU-Bench. To benchmark open-ended audio dialogue understanding for LALMs, we adopt a GPT-4 evaluator to provide evaluation scores as the metric. We also adopt LLaMA-3-70B-Instruct and Qwen-2-72B-Instruct as the evaluator to provide evaluation scores.