BNMusic: Blending Environmental Noises into Personalized Music
Chi Zuo, Martin B. Møller, Pablo Martínez-Nuevo, Huayang Huang, Yu Wu, Ye Zhu
TL;DR
BNMusic addresses the challenge of reducing perceptual annoyance from environmental noise by blending the noise into personalized music rather than canceling it. It introduces a two-stage diffusion-based pipeline: Stage 1 performs noise-aligned music synthesis via outpainting and inpainting around high-energy noise regions, and Stage 2 applies adaptive amplification guided by a masking-threshold objective to maximize perceptual masking with controlled loudness. The approach leverages mel-spectrogram representations, STFT-based masking thresholds, and Griffin-Lim reconstruction to produce audibly coherent music that harmonizes with ambient noise. Experiments on EPIC-SOUNDS, ESC-50, and MusicBench show BNMusic achieving superior objective (FAD, KL) and subjective (OVL, PER) metrics, demonstrating effective noise blending with rhythmically aligned, adaptively amplified music. The work offers a scalable perceptual enhancement for shared environments like public transport, while acknowledging limitations in prompt-noise compatibility and real-time applicability.
Abstract
While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.
