Table of Contents
Fetching ...

Towards Practical Real-Time Low-Latency Music Source Separation

Junyu Wu, Jie Liu, Tianrui Pan, Jie Tang, Gangshan Wu

TL;DR

The paper tackles real-time, low-latency music source separation by introducing RT-STT, a lightweight single-path architecture built upon DTTNet. It leverages channel-expansion feature fusion and a temporal-focused single-path module, and it investigates quantization to accelerate inference. RT-STT achieves a SDR of $5.17$ dB on MUSDB18-HQ, surpassing HS-TasNet's $4.65$ dB with far fewer parameters (less than $10^{6}$) and, after quantization, about $1.01$ ms inference time on GPU, representing a substantial efficiency gain. The work demonstrates practical deployment potential for hearing aids, live performances, and audio remixing by balancing performance, latency, and computational cost, and outlines future improvements for higher-fidelity audio and broader hardware compatibility.

Abstract

In recent years, significant progress has been made in the field of deep learning for music demixing. However, there has been limited attention on real-time, low-latency music demixing, which holds potential for various applications, such as hearing aids, audio stream remixing, and live performances. Additionally, a notable tendency has emerged towards the development of larger models, limiting their applicability in certain scenarios. In this paper, we introduce a lightweight real-time low-latency model called Real-Time Single-Path TFC-TDF UNET (RT-STT), which is based on the Dual-Path TFC-TDF UNET (DTTNet). In RT-STT, we propose a feature fusion technique based on channel expansion. We also demonstrate the superiority of single-path modeling over dual-path modeling in real-time models. Moreover, we investigate the method of quantization to further reduce inference time. RT-STT exhibits superior performance with significantly fewer parameters and shorter inference times compared to state-of-the-art models.

Towards Practical Real-Time Low-Latency Music Source Separation

TL;DR

The paper tackles real-time, low-latency music source separation by introducing RT-STT, a lightweight single-path architecture built upon DTTNet. It leverages channel-expansion feature fusion and a temporal-focused single-path module, and it investigates quantization to accelerate inference. RT-STT achieves a SDR of dB on MUSDB18-HQ, surpassing HS-TasNet's dB with far fewer parameters (less than ) and, after quantization, about ms inference time on GPU, representing a substantial efficiency gain. The work demonstrates practical deployment potential for hearing aids, live performances, and audio remixing by balancing performance, latency, and computational cost, and outlines future improvements for higher-fidelity audio and broader hardware compatibility.

Abstract

In recent years, significant progress has been made in the field of deep learning for music demixing. However, there has been limited attention on real-time, low-latency music demixing, which holds potential for various applications, such as hearing aids, audio stream remixing, and live performances. Additionally, a notable tendency has emerged towards the development of larger models, limiting their applicability in certain scenarios. In this paper, we introduce a lightweight real-time low-latency model called Real-Time Single-Path TFC-TDF UNET (RT-STT), which is based on the Dual-Path TFC-TDF UNET (DTTNet). In RT-STT, we propose a feature fusion technique based on channel expansion. We also demonstrate the superiority of single-path modeling over dual-path modeling in real-time models. Moreover, we investigate the method of quantization to further reduce inference time. RT-STT exhibits superior performance with significantly fewer parameters and shorter inference times compared to state-of-the-art models.

Paper Structure

This paper contains 23 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: A framework of Real-Time Single-Path TFC-TDF UNet when the number of layers is one. F is the number of features on the frequency axis; T is the number of features on the time axis; C is the number of channels of input spectrogram; g indicates the channel increment; L is the number of Single-Path modules; S is the number of target sources.
  • Figure 2: Medium TFC-TDF. F is the number of features on the frequency axis; T is the number of features on the time axis; C is the number of channels generated by the convolution layer; d is the size of the output features for the fully connected layer.