Table of Contents
Fetching ...

Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping

Clémentine Berger, Roland Badeau, Slim Essid

TL;DR

This work tackles masking ambient noise with music by learning spectral envelope shaping that leverages a psychoacoustic masking model. The authors introduce DPNMM, a U‑Net–based network that predicts per‑Band gains g(n,ν) to construct filter responses which, when applied to the music, increase its masking thresholds while preserving original fidelity and user level, using a constrained, multipliers‑based perceptual loss. Evaluations on simulated headphone listening scenes show that DPNMM, especially without strict power constraints, improves masking (NMR) and maintains fidelity (GLD) compared to a state‑of‑the‑art perceptual equalizer, with constrained variants offering better power preservation and broader applicability. The approach advances practical, perceptually informed audio rendering for noisy listening environments and can be extended to other listening contexts and user preferences.

Abstract

People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise's frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music's ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user's chosen listening level. We evaluate our approach on simulated data replicating a user's experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.

Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping

TL;DR

This work tackles masking ambient noise with music by learning spectral envelope shaping that leverages a psychoacoustic masking model. The authors introduce DPNMM, a U‑Net–based network that predicts per‑Band gains g(n,ν) to construct filter responses which, when applied to the music, increase its masking thresholds while preserving original fidelity and user level, using a constrained, multipliers‑based perceptual loss. Evaluations on simulated headphone listening scenes show that DPNMM, especially without strict power constraints, improves masking (NMR) and maintains fidelity (GLD) compared to a state‑of‑the‑art perceptual equalizer, with constrained variants offering better power preservation and broader applicability. The approach advances practical, perceptually informed audio rendering for noisy listening environments and can be extended to other listening contexts and user preferences.

Abstract

People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise's frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music's ability to mask ambient noise by reshaping its spectral envelope with predicted filter frequency responses. The model is trained with a perceptual loss function that balances two constraints: effectively masking the noise while preserving the original music mix and the user's chosen listening level. We evaluate our approach on simulated data replicating a user's experience of listening to music with headphones in a noisy environment. The results, based on defined objective metrics, demonstrate that our system improves the state of the art.

Paper Structure

This paper contains 10 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: Overview of the proposed system. Bark features from the music and noise signals are computed: PSD per Bark band for both music and noise and music masking thresholds. The features are fed to the U-Net that outputs gains in dB used to scale filter frequency responses. They are applied in the spectral domain to the music and a processed version is generated by inverse STFT.
  • Figure 2: U-Net architecture of the proposed model. $N$ is the number of time frames, and 26 the number of Bark bands. The encoder is composed of 4 convolutional layers (Conv), a linear layer and a Gated Recurrent Unit (GRU) layer. The decoder follows the inverse path with transposed convolutional layers (TConv) and pathway convolutions (PConv) as add-skip connections.
  • Figure 3: Pattern $W_{dB}^\nu(f)$ used to shape the filters' frequency responses, for $\nu =$ 5.
  • Figure 4: Obtained NMR and GLD on the test set for the Estreder model and four versions of the proposed neural model with different degrees of power constraint during training : no constraint and $\Delta \mathcal{P}_{max} = 2, 1, 0.5$ dBA.