Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

Che Liu; Changde Du; Xiaoyu Chen; Huiguang He

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

Che Liu, Changde Du, Xiaoyu Chen, Huiguang He

TL;DR

This paper tackles brain-to-audio reconstruction from noninvasive fMRI by reversing the brain's auditory processing hierarchy. It introduces a coarse-to-fine pipeline that first maps fMRI into the CLAP semantic space ($512$-dim) and then into the AudioMAE latent space ($768$-dim), with audio synthesized via a Latent Diffusion Model and a vocoder. Across Brain2Sound, Brain2Music, and Brain2Speech, the approach achieves state-of-the-art performance on metrics such as $FD$, $FAD$, $KL$, and $KL$-S, and demonstrates that semantic prompts can improve audio quality when semantic decoding is suboptimal. The framework offers a scalable, universal brain-to-audio solution with potential implications for neural decoding and brain-computer interfaces, while also highlighting dataset-dependent semantic guidance effects and future directions for refinement.

Abstract

Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utilize CLAP to decode fMRI data coarsely into a low-dimensional semantic space, followed by a fine-grained decoding into the high-dimensional AudioMAE latent space guided by semantic features. These fine-grained neural features serve as conditions for audio reconstruction through a Latent Diffusion Model (LDM). Validation on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech-underscores the superiority of our coarse-to-fine decoding method over stand-alone fine-grained approaches, showcasing state-of-the-art performance in metrics like FD, FAD, and KL. Moreover, by employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal. The demonstrated versatility of our model across diverse stimuli highlights its potential as a universal brain-to-audio framework. This research contributes to the comprehension of the human auditory system, pushing boundaries in neural decoding and audio reconstruction methodologies.

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

TL;DR

-dim) and then into the AudioMAE latent space (

-dim), with audio synthesized via a Latent Diffusion Model and a vocoder. Across Brain2Sound, Brain2Music, and Brain2Speech, the approach achieves state-of-the-art performance on metrics such as

, and

-S, and demonstrates that semantic prompts can improve audio quality when semantic decoding is suboptimal. The framework offers a scalable, universal brain-to-audio solution with potential implications for neural decoding and brain-computer interfaces, while also highlighting dataset-dependent semantic guidance effects and future directions for refinement.

Abstract

Paper Structure (25 sections, 4 equations, 10 figures, 3 tables)

This paper contains 25 sections, 4 equations, 10 figures, 3 tables.

Introduction
Method
Coarse-to-fine brain decoding
Coarse-grained semantic decoding
Fine-grained acoustic decoding
Brain-to-audio reconstruction
Conditional reconstruction
Experiments
Datasets
Metrics
Reconstruction results
Semantic analysis of acoustic features
Conditional reconstruction results
Conclusion
Appendix
...and 10 more sections

Figures (10)

Figure 1: (a) The hierarchical auditory processing pathway of humans. The stimulus audio is gradually decomposed into time-frequency representation, low-level acoustic features, and high-level semantic characteristics. (b) The pipeline for our coarse-to-fine reconstruction from fMRI. Brain activity is decoded progressively into semantic, acoustic, and spectrogram levels, ultimately resulting in reconstructed audio.
Figure 2: (a) Coarse-to-fine brain decoding. In the coarse-grained decoding, fMRI is decoded into the semantic space of CLAP. In the fine-grained decoding, fMRI is decoded into the acoustic space of AudioMAE. (b) Detailed structure of Acoustic Decoder.
Figure 3: Brain-to-audio reconstruction. The LDM generates mel-spectrograms under the condition of fine-grained acoustic features, followed by the Vocoder to generate reconstructed audios.
Figure 4: Reconstruction results of S1, sub-001 and UTS01 on the three datasets.
Figure 5: PCC between the ground truth and decoded acoustic features for 17 subjects in the Brain2Sound, Brain2Music and Brain2Speech datasets. Our coarse-to-fine method consistently outperforms the directly fine-grained method.
...and 5 more figures

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

TL;DR

Abstract

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI

Authors

TL;DR

Abstract

Table of Contents

Figures (10)