Table of Contents
Fetching ...

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li

TL;DR

The first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios is proposed, demonstrating that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations.

Abstract

Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

TL;DR

The first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios is proposed, demonstrating that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations.

Abstract

Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
Paper Structure (16 sections, 3 figures, 9 tables)

This paper contains 16 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Motivating Example - Qwen-Audio's responses for single audio inputs and a two-audio input task.
  • Figure 2: Overview of proposed MAE benchmark. Yellow blocks show the speech tasks, while grey blocks show the sound tasks. Speech transcriptions, audio labels, input instructions, and expected output responses are given.
  • Figure 3: MALLM training framework: speech/sound pair synthesis, and unified synthetic discriminative training.