Table of Contents
Fetching ...

Improving Real-Time Music Accompaniment Separation with MMDenseNet

Chun-Hsiang Wang, Chung-Che Wang, Jun-You Wang, Jyh-Shing Roger Jang, Yen-Hsun Chu

TL;DR

This work targets real-time accompaniment separation with low latency on edge devices by enhancing the lightweight MMDenseNet. It introduces four approaches—complex ideal ratio mask, self-attention, band-merge-split, and feature look back—to improve separation quality while keeping runtime suitable for real-time use. On MUSDB18, the methods yield SDR improvements from about 11.2 with a baseline to around 15.0 in certain configurations, with real-time factor and latency kept low enough for practical edge deployment. The results demonstrate a viable path for real-time karaoke and similar applications, and suggest further exploration of additional subbands and diverse source types.

Abstract

Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.

Improving Real-Time Music Accompaniment Separation with MMDenseNet

TL;DR

This work targets real-time accompaniment separation with low latency on edge devices by enhancing the lightweight MMDenseNet. It introduces four approaches—complex ideal ratio mask, self-attention, band-merge-split, and feature look back—to improve separation quality while keeping runtime suitable for real-time use. On MUSDB18, the methods yield SDR improvements from about 11.2 with a baseline to around 15.0 in certain configurations, with real-time factor and latency kept low enough for practical edge deployment. The results demonstrate a viable path for real-time karaoke and similar applications, and suggest further exploration of additional subbands and diverse source types.

Abstract

Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.
Paper Structure (12 sections, 1 equation, 8 figures, 1 table)

This paper contains 12 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Structure of the original MMDenseNet takahashi2017multi.
  • Figure 2: The modified MMDenseNet which uses cIRM as the new output form. $\hat{M}_{mag}$, $\hat{Q}$, $\hat{P}_{r}$ and $\hat{P}_{i}$ are respectively magnitude mask estimation, magnitude estimation, phase estimation of real part, and phase estimation of imaginary part. $F$ and $N$ are respectively 1,025 and 2, and $T$ varies in different experiments in this paper.
  • Figure 3: The adjusted structure of the self-attention along time axis. $E$ and $C'$ are respectively set to 20 and 5 in this paper. "PW" stands for "pointwise".
  • Figure 4: The adjusted structure of the self-attention along frequency axis. $N$ is $T/t$, where $t$ is 16. "PW" stands for "pointwise".
  • Figure 5: When using both the self-attention along time and frequency axis for fullband MDenseNet, the self-attention along frequency axis is used before the self-attention along time axis. SA: self-attention.
  • ...and 3 more figures