Table of Contents
Fetching ...

Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation

Yun-Ning, Hung, Igor Pereira, Filip Korzeniowski

TL;DR

Moises-Light demonstrates that a carefully engineered lightweight model (~$5$ million parameters per stem) can achieve competitive music source separation on MUSDB-HQ with strong data augmentation and multi-resolution loss. By fusing RoPE transformer-based sequence modeling, efficient band-splitting, and SCNet-inspired encoder–decoder design within a DTTNet backbone, it delivers significant performance gains over prior lightweight approaches while maintaining a small parameter footprint. Ablation studies quantify the contributions of RoPE, band-splitting, and training strategies, and the model shows strong scalability when trained with additional data from MoisesDB. The work highlights the practical viability of efficient MSS models for edge devices and real-time settings, while noting limitations in drum separation and bass modeling that warrant further investigation.

Abstract

In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.

Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation

TL;DR

Moises-Light demonstrates that a carefully engineered lightweight model (~ million parameters per stem) can achieve competitive music source separation on MUSDB-HQ with strong data augmentation and multi-resolution loss. By fusing RoPE transformer-based sequence modeling, efficient band-splitting, and SCNet-inspired encoder–decoder design within a DTTNet backbone, it delivers significant performance gains over prior lightweight approaches while maintaining a small parameter footprint. Ablation studies quantify the contributions of RoPE, band-splitting, and training strategies, and the model shows strong scalability when trained with additional data from MoisesDB. The work highlights the practical viability of efficient MSS models for edge devices and real-time settings, while noting limitations in drum separation and bass modeling that warrant further investigation.

Abstract

In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.

Paper Structure

This paper contains 15 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The overall architecture of our proposed model.