Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation
Yun-Ning, Hung, Igor Pereira, Filip Korzeniowski
TL;DR
Moises-Light demonstrates that a carefully engineered lightweight model (~$5$ million parameters per stem) can achieve competitive music source separation on MUSDB-HQ with strong data augmentation and multi-resolution loss. By fusing RoPE transformer-based sequence modeling, efficient band-splitting, and SCNet-inspired encoder–decoder design within a DTTNet backbone, it delivers significant performance gains over prior lightweight approaches while maintaining a small parameter footprint. Ablation studies quantify the contributions of RoPE, band-splitting, and training strategies, and the model shows strong scalability when trained with additional data from MoisesDB. The work highlights the practical viability of efficient MSS models for edge devices and real-time settings, while noting limitations in drum separation and bass modeling that warrant further investigation.
Abstract
In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.
