Table of Contents
Fetching ...

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

Emilian Postolache, Natalia Polouliakh, Hiroaki Kitano, Akima Connelly, Emanuele Rodolà, Luca Cosmo, Taketo Akama

TL;DR

This work tackles reconstructing naturalistic music from non-invasive EEG signals using latent diffusion models conditioned by ControlNet adapters. The proposed EEG-conditioned pipeline builds on AudioLDM2, applying a lightweight projector to align EEG data with the diffusion model's latent space while keeping pre-processing minimal. Evaluation relies on neural-embedding-based metrics (CLAP and EnCodec) and Fréchet distances to capture semantic audio attributes despite low EEG temporal resolution, reporting improvements over a convolutional baseline and showing promising performance on held-out tracks. The study demonstrates the feasibility of non-invasive brain-to-audio reconstruction for complex musical stimuli and points to the need for larger datasets and further methodological refinements to enhance distributional generalization and real-time applicability.

Abstract

In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.

Naturalistic Music Decoding from EEG Data via Latent Diffusion Models

TL;DR

This work tackles reconstructing naturalistic music from non-invasive EEG signals using latent diffusion models conditioned by ControlNet adapters. The proposed EEG-conditioned pipeline builds on AudioLDM2, applying a lightweight projector to align EEG data with the diffusion model's latent space while keeping pre-processing minimal. Evaluation relies on neural-embedding-based metrics (CLAP and EnCodec) and Fréchet distances to capture semantic audio attributes despite low EEG temporal resolution, reporting improvements over a convolutional baseline and showing promising performance on held-out tracks. The study demonstrates the feasibility of non-invasive brain-to-audio reconstruction for complex musical stimuli and points to the need for larger datasets and further methodological refinements to enhance distributional generalization and real-time applicability.

Abstract

In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
Paper Structure (12 sections, 5 equations, 3 figures, 1 table)

This paper contains 12 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of proposed method. We use ControlNet for conditioning a diffusion model on EEG data, in order to decode high-quality naturalistic music.
  • Figure 2: Qualitative results of our method. On the left ground truth, musical chunks. In the middle, reconstructions obtained via a baseline ConvNet. On the right, decodings obtained by our method. Notice how our method better matches the real tracks.
  • Figure 3: Cross-CLAP scores between decoded and ground truth tracks. Left: Our method (ControlNet-2). Right: Baseline convolutional network. Notice how the matrices are closer to diagonal with our method, indicating higher correlation (as measured by CLAP score) between decoded and ground truth tracks.