DECAF: Dynamic Envelope Context-Aware Fusion for Speech-Envelope Reconstruction from EEG
Karan Thakkar, Mounya Elhilali
TL;DR
This work addresses the fidelity limits of EEG-based speech envelope reconstruction when treated as a static mapping. It introduces DECAF, a dynamic, state-space fusion that jointly leverages a neural envelope estimate from EEG and an autoregressive temporal prior derived from past predictions through a gated, fully causal architecture, enabling online use. The approach unifies an EEG-to-envelope decoder, an Envelope Forecaster, and a Dynamic Fusion Gate, culminating in the final envelope via $A_n = \alpha \hat{A}_{eeg} + (1 - \alpha) \hat{A}_{prior}$ and optimized with a hybrid loss $L = \lambda_1 \mathcal{L}_{\text{L1}}(A_n, A_{\text{true}}) - \lambda_2 \rho(A_n, A_{\text{true}})$ where $\lambda_1 = 1$, $\lambda_2 = 0.2$. On the ICASSP 2023 Task 2 benchmark, DECAF achieves state-of-the-art performance ($M = 0.170 \pm 0.061$) surpassing previous methods, with ablation results showing the synergistic value of combining neural evidence and temporal context. The work demonstrates that framing envelope reconstruction as a dynamic state-estimation problem yields higher fidelity envelopes, supporting more accurate and coherent neural decoding for neuro-steered hearing applications and online BCI systems.
Abstract
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Prevailing approaches treat it as a static regression problem, processing each EEG window in isolation and ignoring the rich temporal structure inherent in continuous speech. This study introduces a new, dynamic framework for envelope reconstruction that leverages this structure as a predictive temporal prior. We propose a state-space fusion model that combines direct neural estimates from EEG with predictions from recent speech context, using a learned gating mechanism to adaptively balance these cues. To validate this approach, we evaluate our model on the ICASSP 2023 Stimulus Reconstruction benchmark demonstrating significant improvements over static, EEG-only baselines. Our analyses reveal a powerful synergy between the neural and temporal information streams. Ultimately, this work reframes envelope reconstruction not as a simple mapping, but as a dynamic state-estimation problem, opening a new direction for developing more accurate and coherent neural decoding systems.
