Table of Contents
Fetching ...

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

Karan Thakkar, Mounya Elhilali

TL;DR

This work addresses the fidelity limits of EEG-based speech envelope reconstruction when treated as a static mapping. It introduces DECAF, a dynamic, state-space fusion that jointly leverages a neural envelope estimate from EEG and an autoregressive temporal prior derived from past predictions through a gated, fully causal architecture, enabling online use. The approach unifies an EEG-to-envelope decoder, an Envelope Forecaster, and a Dynamic Fusion Gate, culminating in the final envelope via $A_n = \alpha \hat{A}_{eeg} + (1 - \alpha) \hat{A}_{prior}$ and optimized with a hybrid loss $L = \lambda_1 \mathcal{L}_{\text{L1}}(A_n, A_{\text{true}}) - \lambda_2 \rho(A_n, A_{\text{true}})$ where $\lambda_1 = 1$, $\lambda_2 = 0.2$. On the ICASSP 2023 Task 2 benchmark, DECAF achieves state-of-the-art performance ($M = 0.170 \pm 0.061$) surpassing previous methods, with ablation results showing the synergistic value of combining neural evidence and temporal context. The work demonstrates that framing envelope reconstruction as a dynamic state-estimation problem yields higher fidelity envelopes, supporting more accurate and coherent neural decoding for neuro-steered hearing applications and online BCI systems.

Abstract

Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.

DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG

TL;DR

This work addresses the fidelity limits of EEG-based speech envelope reconstruction when treated as a static mapping. It introduces DECAF, a dynamic, state-space fusion that jointly leverages a neural envelope estimate from EEG and an autoregressive temporal prior derived from past predictions through a gated, fully causal architecture, enabling online use. The approach unifies an EEG-to-envelope decoder, an Envelope Forecaster, and a Dynamic Fusion Gate, culminating in the final envelope via and optimized with a hybrid loss where , . On the ICASSP 2023 Task 2 benchmark, DECAF achieves state-of-the-art performance () surpassing previous methods, with ablation results showing the synergistic value of combining neural evidence and temporal context. The work demonstrates that framing envelope reconstruction as a dynamic state-estimation problem yields higher fidelity envelopes, supporting more accurate and coherent neural decoding for neuro-steered hearing applications and online BCI systems.

Abstract

Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
Paper Structure (7 sections, 5 equations, 4 figures, 1 table)

This paper contains 7 sections, 5 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the shift from static to dynamic decoding. (Top) Static baselines perform stateless reconstruction using only an isolated 'Present Window' of EEG. (Bottom) Our dynamic model, DECAF, is state-aware, creating a temporal prior from past context and fusing it with present EEG information. The dashed loop indicates the model's fully recursive operation, using its own past predictions.
  • Figure 2: The system generates the current envelope prediction ($A_n$) by fusing a direct neural estimate from the EEG ($E_n$) with a temporal prediction derived from its own past output ($A_{n-1}$).
  • Figure 3: The baseline models (left three panels) effectively capture low-frequency energy but fail to reconstruct higher-frequency details compared to the ground truth (black). The rightmost panel decomposes our proposed model, DECAF. The final Fusion output (blue) synergistically combines the low-frequency accuracy of the EEG branch (red) with the high-frequency information from the Envelope Forecaster (orange), allowing it to track the ground-truth spectrum with significantly higher fidelity.
  • Figure 4: Reconstruction performance across varying EEG noise levels (SNR).