Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation
Adam Sorrenti
TL;DR
This work tackles singing voice separation from complex musical mixtures using a spectrogram-based pipeline that couples Short Time Fourier Transform representations with a U‑Net segmentation model. It systematically analyzes normalization strategies and loss functions, finding that Min/Max scaling combined with Mean Absolute Error (MAE) loss delivers strong objective separation metrics on the MUSDB18 dataset, notably achieving an SDR around 7.1 dB and SIR/SAR improvements. Data augmentation via splicing and blackout further enhances model robustness, while evaluation via SDR, SIR, and SAR provides a comprehensive view of separation quality. The study offers practical guidance for designing spectrogram-based vocal separation systems and suggests avenues for even tighter optimization, such as SDR-focused losses, with a public GitHub repository for reproducibility.
Abstract
Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg
