Table of Contents
Fetching ...

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

Adam Sorrenti

TL;DR

This work tackles singing voice separation from complex musical mixtures using a spectrogram-based pipeline that couples Short Time Fourier Transform representations with a U‑Net segmentation model. It systematically analyzes normalization strategies and loss functions, finding that Min/Max scaling combined with Mean Absolute Error (MAE) loss delivers strong objective separation metrics on the MUSDB18 dataset, notably achieving an SDR around 7.1 dB and SIR/SAR improvements. Data augmentation via splicing and blackout further enhances model robustness, while evaluation via SDR, SIR, and SAR provides a comprehensive view of separation quality. The study offers practical guidance for designing spectrogram-based vocal separation systems and suggests avenues for even tighter optimization, such as SDR-focused losses, with a public GitHub repository for reproducibility.

Abstract

Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg

Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation

TL;DR

This work tackles singing voice separation from complex musical mixtures using a spectrogram-based pipeline that couples Short Time Fourier Transform representations with a U‑Net segmentation model. It systematically analyzes normalization strategies and loss functions, finding that Min/Max scaling combined with Mean Absolute Error (MAE) loss delivers strong objective separation metrics on the MUSDB18 dataset, notably achieving an SDR around 7.1 dB and SIR/SAR improvements. Data augmentation via splicing and blackout further enhances model robustness, while evaluation via SDR, SIR, and SAR provides a comprehensive view of separation quality. The study offers practical guidance for designing spectrogram-based vocal separation systems and suggests avenues for even tighter optimization, such as SDR-focused losses, with a public GitHub repository for reproducibility.

Abstract

Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg
Paper Structure (22 sections, 7 equations, 6 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 7 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: A system model depicting the process of complex-valued spectrogram estimation followed by U-Net ronneberger2015unet segmentation and subsequent signal reconstruction through iSTFT.
  • Figure 2: Spectrogram examples illustrating the effect of splicing. The leftmost image shows the original mix, the middle image shows the original mix shifted by one time step, and the rightmost image presents the result of splicing the mix spectrograms. The red highlights show the component halves.
  • Figure 3: Spectrogram examples showing the blackout data augmentation. The top row displays the original mix and vocal spectrograms, while the bottom row demonstrates the effect of the blackout augmentation on both.
  • Figure 4: The original example integer spectrogram before normalization with Time on the x-axis and Frequency on the y-axis. The intensity values were randomly assigned.
  • Figure 5: Min/max normalization applied to the spectrogram. The left plot shows min/max normalization across the frequency axis, while the right plot applies min/max normalization across the time axis.
  • ...and 1 more figures