Table of Contents
Fetching ...

BNMusic: Blending Environmental Noises into Personalized Music

Chi Zuo, Martin B. Møller, Pablo Martínez-Nuevo, Huayang Huang, Yu Wu, Ye Zhu

TL;DR

BNMusic addresses the challenge of reducing perceptual annoyance from environmental noise by blending the noise into personalized music rather than canceling it. It introduces a two-stage diffusion-based pipeline: Stage 1 performs noise-aligned music synthesis via outpainting and inpainting around high-energy noise regions, and Stage 2 applies adaptive amplification guided by a masking-threshold objective to maximize perceptual masking with controlled loudness. The approach leverages mel-spectrogram representations, STFT-based masking thresholds, and Griffin-Lim reconstruction to produce audibly coherent music that harmonizes with ambient noise. Experiments on EPIC-SOUNDS, ESC-50, and MusicBench show BNMusic achieving superior objective (FAD, KL) and subjective (OVL, PER) metrics, demonstrating effective noise blending with rhythmically aligned, adaptively amplified music. The work offers a scalable perceptual enhancement for shared environments like public transport, while acknowledging limitations in prompt-noise compatibility and real-time applicability.

Abstract

While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.

BNMusic: Blending Environmental Noises into Personalized Music

TL;DR

BNMusic addresses the challenge of reducing perceptual annoyance from environmental noise by blending the noise into personalized music rather than canceling it. It introduces a two-stage diffusion-based pipeline: Stage 1 performs noise-aligned music synthesis via outpainting and inpainting around high-energy noise regions, and Stage 2 applies adaptive amplification guided by a masking-threshold objective to maximize perceptual masking with controlled loudness. The approach leverages mel-spectrogram representations, STFT-based masking thresholds, and Griffin-Lim reconstruction to produce audibly coherent music that harmonizes with ambient noise. Experiments on EPIC-SOUNDS, ESC-50, and MusicBench show BNMusic achieving superior objective (FAD, KL) and subjective (OVL, PER) metrics, demonstrating effective noise blending with rhythmically aligned, adaptively amplified music. The work offers a scalable perceptual enhancement for shared environments like public transport, while acknowledging limitations in prompt-noise compatibility and real-time applicability.

Abstract

While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.

Paper Structure

This paper contains 21 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a): Noise cancelling is designed for individual use, requiring proximity to the user, while our noise blending aims to reduce noise annoyance for everyone in the room by seamlessly blending a complementary sound with the surrounding noise. (b): This figure demonstrates the principle of auditory masking in psychoacoustics Painter2000. The yellow dashed line indicates the threshold in quiet, i.e., the baseline level below which sounds are inaudible in the absence of other stimuli. When a music signal is introduced, it elevates the masking threshold in a frequency-dependent manner, shown by the yellow solid line. As a result, any concurrent noise components that fall below this elevated threshold become imperceptible to the listener, effectively masked by the presence of the music.
  • Figure 2: Overall pipeline of our proposed BNMusic framework to achieve noise blending with frozen music generators. The two stages of our approach are marked with different background colors. In Stage 1, our approach generates music that aligns with the noise, and in Stage 2 we adaptive amplify the music signal to reach the most ideal and reasonable blending with the noise.
  • Figure 3: Visualization of noise-music blending effectiveness across methods. The left images display the mel-spectrograms of three types of noise, while the right heatmaps show the differences between the generated music and the noise. The heatmaps illustrate how four music samples blend with the respective noise. Red represents positive values, blue indicates negative values and darker colors correspond to larger magnitudes, highlighting the blending effectiveness of each music type.
  • Figure 4: The lowest FAD values for each type of music are highlighted, and they all appear in the -15 dB LUFS, indicating that for all types of music in our experiment, an optimal loudness level consistently falls between -18 to -12 dB LUFS.
  • Figure 5: Illustration of how conventional auditory masking, often using overly low volume and mismatched rhythm, can disrupt the listening experience. Proper volume adjustment and rhythmic alignment are essential for achieving a more harmonious and pleasant blend with background noise.
  • ...and 3 more figures