Can Large Language Models Understand Spatial Audio?

Changli Tang; Wenyi Yu; Guangzhi Sun; Xianzhao Chen; Tian Tan; Wei Li; Jun Zhang; Lu Lu; Zejun Ma; Yuxuan Wang; Chao Zhang

Can Large Language Models Understand Spatial Audio?

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

Abstract

This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and localisation-informed speech extraction (LSE), achieving notable progress in each task. For SSL, our approach achieves an MAE of $2.70^{\circ}$ on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about $6.60^{\circ}$. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments.

Can Large Language Models Understand Spatial Audio?

Abstract

on the Spatial LibriSpeech dataset, substantially surpassing the prior benchmark of about

. Moreover, our model can employ spatial cues to improve FSR accuracy and execute LSE by selectively attending to sounds originating from a specified direction via text prompts, even amidst overlapping speech. These findings highlight the potential of adapting LLMs to grasp physical audio concepts, paving the way for LLM-based agents in 3D environments.

Paper Structure (18 sections, 5 equations, 2 figures, 3 tables)

This paper contains 18 sections, 5 equations, 2 figures, 3 tables.

Introduction
Related Work
On Auditory LLMs
On 3D SSL
On Spatial FSR
Methods
Model Structure
Training Strategy
The DSS LibriSpeech Dataset
Experimental Setup
Model Specifications
Data Specifications
Task Specifications
Experimental Results
3D Sound Source Localisation (SSL)
...and 3 more sections

Figures (2)

Figure 1: The model structure is shown above. There are two options for introducing spatial information, adding intensity vectors before or after the Q-Former, respectively. Numbers can be also added to the LLM vocabulary as special tokens (ST.), optionally.
Figure 2: Results on the left/right dataset with overlapping ratio from 0% to 70%. The dotted lines are the performance on the "fully" overlapped test set generated by activating the two sources simultaneously. (as described in Footnote \ref{['fn_1']})

Can Large Language Models Understand Spatial Audio?

Abstract

Can Large Language Models Understand Spatial Audio?

Authors

Abstract

Table of Contents

Figures (2)