Table of Contents
Fetching ...

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Daniel Stoller, Sebastian Ewert, Simon Dixon

TL;DR

The paper tackles end-to-end audio source separation by introducing Wave-U-Net, a 1D multi-scale adaptation of U-Net that operates directly on time-domain waveforms to capture long-range temporal dependencies and phase information. It presents architectural innovations, including a difference output layer, proper input context with resampling, stereo support, and learned upsampling, to improve separation quality while mitigating artifacts. Experimental results on singing voice and multi-instrument tasks show competitive performance against state-of-the-art spectrogram-based approaches under comparable training conditions, and the work highlights evaluation challenges with SDR metrics, proposing rank-based summaries as a workaround. The study demonstrates the viability of time-domain separation and provides practical guidance for reducing border artifacts and upsampling-induced artifacts, with avenues for future improvements in losses and dataset scale.

Abstract

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

TL;DR

The paper tackles end-to-end audio source separation by introducing Wave-U-Net, a 1D multi-scale adaptation of U-Net that operates directly on time-domain waveforms to capture long-range temporal dependencies and phase information. It presents architectural innovations, including a difference output layer, proper input context with resampling, stereo support, and learned upsampling, to improve separation quality while mitigating artifacts. Experimental results on singing voice and multi-instrument tasks show competitive performance against state-of-the-art spectrogram-based approaches under comparable training conditions, and the work highlights evaluation challenges with SDR metrics, proposing rank-based summaries as a workaround. The study demonstrates the viability of time-domain separation and provides practical guidance for reducing border artifacts and upsampling-induced artifacts, with avenues for future improvements in losses and dataset scale.

Abstract

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

Paper Structure

This paper contains 21 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our proposed Wave-U-Net with $K$ sources and $L$ layers. With our difference output layer, the $K$-th source prediction is the difference between the mixture and the sum of the other sources.
  • Figure 2: a) Common model (e.g. Jansson2017) with an even number of inputs (grey) which are zero-padded (black) before convolving, creating artifacts at the borders (dark colours). After decimation, a transposed convolution with stride 2 is shown here as upsampling by zero-padding intermediate and border values followed by normal convolution, which likely creates high-frequency artifacts in the output. b) Our model with proper input context and linear interpolation for upsampling from Section \ref{['sec:model_improv_context']} does not use zero-padding. The number of features is kept uneven, so that upsampling does not require extrapolating values (red arrow). Although the output is smaller, artifacts are avoided.
  • Figure 3: Violin plot of the segment-wise SDR values in the MUSDB test set for model M5. Black points show medians, dark blue lines the means.
  • Figure 4: Power spectrogram (dB) of a vocal estimate excerpt generated by a model without additional input context. Red markers show boundaries between independent segment-wise predictions.
  • Figure :