Table of Contents
Fetching ...

Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)

Junyu Chen, Susmitha Vekkot, Pancham Shukla

TL;DR

DTTNet addresses the need for lightweight, scalable music source separation by fusing a Dual-Path Module with a Time-Frequency Convolutions Time-Distributed Fully-connected UNet. It delivers competitive cSDR for vocals while using far fewer parameters than state-of-the-art baselines, aided by channel-wise head partitioning and BLSTM-based intra-/inter-band modeling. The work also investigates generalization to intricate audio patterns via a bespoke dataset, demonstrating gains when Vocal Chops-aware training is used and highlighting potential overfitting risks on smaller pattern sets. Overall, DTTNet provides a practical, efficient MSS approach with strong vocal-separation performance and promising generalization, with future work targeting improvements for drums and bass and the integration of zero-shot post-processing.

Abstract

Music source separation (MSS) aims to extract 'vocals', 'drums', 'bass' and 'other' tracks from a piece of mixed music. While deep learning methods have shown impressive results, there is a trend toward larger models. In our paper, we introduce a novel and lightweight architecture called DTTNet, which is based on Dual-Path Module and Time-Frequency Convolutions Time-Distributed Fully-connected UNet (TFC-TDF UNet). DTTNet achieves 10.12 dB cSDR on 'vocals' compared to 10.01 dB reported for Bandsplit RNN (BSRNN) but with 86.7% fewer parameters. We also assess pattern-specific performance and model generalization for intricate audio patterns.

Music Source Separation Based on a Lightweight Deep Learning Framework (DTTNET: DUAL-PATH TFC-TDF UNET)

TL;DR

DTTNet addresses the need for lightweight, scalable music source separation by fusing a Dual-Path Module with a Time-Frequency Convolutions Time-Distributed Fully-connected UNet. It delivers competitive cSDR for vocals while using far fewer parameters than state-of-the-art baselines, aided by channel-wise head partitioning and BLSTM-based intra-/inter-band modeling. The work also investigates generalization to intricate audio patterns via a bespoke dataset, demonstrating gains when Vocal Chops-aware training is used and highlighting potential overfitting risks on smaller pattern sets. Overall, DTTNet provides a practical, efficient MSS approach with strong vocal-separation performance and promising generalization, with future work targeting improvements for drums and bass and the integration of zero-shot post-processing.

Abstract

Music source separation (MSS) aims to extract 'vocals', 'drums', 'bass' and 'other' tracks from a piece of mixed music. While deep learning methods have shown impressive results, there is a trend toward larger models. In our paper, we introduce a novel and lightweight architecture called DTTNet, which is based on Dual-Path Module and Time-Frequency Convolutions Time-Distributed Fully-connected UNet (TFC-TDF UNet). DTTNet achieves 10.12 dB cSDR on 'vocals' compared to 10.01 dB reported for Bandsplit RNN (BSRNN) but with 86.7% fewer parameters. We also assess pattern-specific performance and model generalization for intricate audio patterns.
Paper Structure (16 sections, 2 figures, 3 tables)

This paper contains 16 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: A framework of Dual Path TFC-TDF UNet when layer depth $D = 2$, where $L$ is the number of repeats of Improved Dual-Path Module (IDPM); $C$ as the number of channels of input spectrogram with $g$ being the channel incremental factor; $T$ and $F$ are the time and frequency axes that 2D convolution operates on.
  • Figure 2: Sub-blocks of DTTNet, where $bf$ is the bottleneck factor of Time Distributed Fully-connected layer (TDF); $B$ is the batch size; $F$ is the number of features on the frequency axis; $T$ is the number of features on the time axis; $C$ is the number of channels generated by the convolution layer; $L$ is the number of repeats of IDPM.