Table of Contents
Fetching ...

A Probabilistic Generative Model for Spectral Speech Enhancement

Marco Hidalgo-Araya, Raphaël Trésor, Bart Van Erp, Wouter W. L. Nuijten, Thijs Van De Laar, Bert De Vries

Abstract

Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.

A Probabilistic Generative Model for Spectral Speech Enhancement

Abstract

Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.

Paper Structure

This paper contains 55 sections, 81 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Proposed decomposition of \ref{['eq:py|xr']}; see Section \ref{['sec:modularization']} for a detailed discussion.
  • Figure 2: Forney-style factor graph of the AIDA-2 framework defined in \ref{['eq:AIDA-full-architecture']}. The figure illustrates the conditional-dependency structure among the four modules: the Warped-frequency Filter Bank (WFB), the Bayesian Speech Enhancement Model (SEM), the Acoustic Context Model (ACM), and the End-User Model (EUM). Arrows indicate conditional dependencies and information flow within the generative framework (see Section \ref{['sec:modularization']}).
  • Figure 3: Illustration of message passing on a Forney-style factor graph. Each node $f_a$ represents a factor, and each edge $s_j$ a variable shared between factors. Forward messages $\vec{\mu}_{j}(s_j)$ and backward messages $\reflectbox{\vec{\reflectbox{\mu}}}_{j}(s_j)$ propagate in opposite directions along edges, summarizing information from their respective subgraphs.The posterior marginal on edge $s_j$ is obtained as the normalized product of the two colliding messages, as in \ref{['eq:tree_marginal']}.
  • Figure 4: Forney-style factor graph representation of the SEM, shown without the analysis module, which transforms $z_m$ into the log-spectral coefficients $\tilde{z}_m$, and the synthesis module, which maps the spectral weights $\tilde{w}_m$ back to the time-domain filter coefficients $w_m$. Each node corresponds to a probabilistic factor, and edges denote shared latent variables.
  • Figure 5: Block diagram of the Warped-Frequency Filter Bank (WFB) architecture. The input signal $x_k$ passes through a cascade of first-order all-pass filters $A(q^{-1})$, producing warped delay-line signals $z_{kj}$ with internal states $v_{kj}$. A time-domain FIR structure with weights $w_m$ generates the output $y_k$. In parallel, the warped signals $z_{kj}$ are provided to the Spectral Enhancement Model (SEM), which infers and synthesizes the time-domain coefficients $w_m$, enabling perceptually aligned, low-latency enhancement.

Theorems & Definitions (1)

  • Definition B.1: Bayesian Leaky Integrator