Table of Contents
Fetching ...

DTT-BSR: GAN-based DTTNet with RoPE Transformer Enhancement for Music Source Restoration

Shihong Tan, Haoyu Wang, Youran Ni, Yingzhao Hou, Jiayue Luo, Zipei Hu, Han Dou, Zerui Han, Ningning Pan, Yuzhu Wang, Gongping Huang

TL;DR

Music source restoration requires both isolating sources and restoring degraded signals from mixed, mastered recordings. The authors introduce DTT-BSR, a GAN-based architecture that blends a DTTNet backbone with RoPE transformer blocks and a dual-path RNN to enable long-range temporal modeling and multi-resolution spectral processing in an end-to-end framework. The approach uses a composite loss and evaluates on MSRBench, achieving 3rd objective and 4th subjective placement in the ICASSP 2026 MSR Challenge with a compact model of 7.1 million parameters and publicly released code. This work advances practical, high-fidelity restoration for music, particularly improving non-vocal instrument separation while maintaining efficient training and inference.

Abstract

Music source restoration (MSR) aims to recover unprocessed stems from mixed and mastered recordings. The challenge lies in both separating overlapping sources and reconstructing signals degraded by production effects such as compression and reverberation. We therefore propose DTT-BSR, a hybrid generative adversarial network (GAN) combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network (RNN) for multi-resolution spectral processing. Our model achieved 3rd place on the objective leaderboard and 4th place on the subjective leaderboard on the ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with a compact size of 7.1M parameters.

DTT-BSR: GAN-based DTTNet with RoPE Transformer Enhancement for Music Source Restoration

TL;DR

Music source restoration requires both isolating sources and restoring degraded signals from mixed, mastered recordings. The authors introduce DTT-BSR, a GAN-based architecture that blends a DTTNet backbone with RoPE transformer blocks and a dual-path RNN to enable long-range temporal modeling and multi-resolution spectral processing in an end-to-end framework. The approach uses a composite loss and evaluates on MSRBench, achieving 3rd objective and 4th subjective placement in the ICASSP 2026 MSR Challenge with a compact model of 7.1 million parameters and publicly released code. This work advances practical, high-fidelity restoration for music, particularly improving non-vocal instrument separation while maintaining efficient training and inference.

Abstract

Music source restoration (MSR) aims to recover unprocessed stems from mixed and mastered recordings. The challenge lies in both separating overlapping sources and reconstructing signals degraded by production effects such as compression and reverberation. We therefore propose DTT-BSR, a hybrid generative adversarial network (GAN) combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network (RNN) for multi-resolution spectral processing. Our model achieved 3rd place on the objective leaderboard and 4th place on the subjective leaderboard on the ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with a compact size of 7.1M parameters.
Paper Structure (7 sections, 1 equation, 1 figure, 2 tables)

This paper contains 7 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Our Proposed Model Architecture