Mel-RoFormer for Vocal Separation and Vocal Melody Transcription
Ju-Chiang Wang, Wei-Tsung Lu, Jitong Chen
TL;DR
Mel-RoFormer tackles the challenge of modeling complex musical spectra by explicitly modeling frequency and time as separate sequences using interleaved RoPE Transformers and a learnable Mel-band front-end. Operating on complex spectrograms, it estimates complex masks and reconstructs separated voices, while leveraging a two-stage pipeline to achieve high performance on vocal separation and vocal melody transcription. The model combines a Mel-band Projection, RoFormer blocks, and an Embedding Projection to produce robust embeddings for downstream tasks, with multi-resolution STFT losses guiding separation and an onset-frame approach guiding transcription. Results on MUSDB18HQ, MIR-ST500, and POP909 demonstrate state-of-the-art performance, and the authors highlight Mel-RoFormer’s potential as a versatile MIR foundation model for related tasks such as chord recognition and multi-instrument transcription.
Abstract
Developing a versatile deep neural network to model music audio is crucial in MIR. This task is challenging due to the intricate spectral variations inherent in music signals, which convey melody, harmonics, and timbres of diverse instruments. In this paper, we introduce Mel-RoFormer, a spectrogram-based model featuring two key designs: a novel Mel-band Projection module at the front-end to enhance the model's capability to capture informative features across multiple frequency bands, and interleaved RoPE Transformers to explicitly model the frequency and time dimensions as two separate sequences. We apply Mel-RoFormer to tackle two essential MIR tasks: vocal separation and vocal melody transcription, aimed at isolating singing voices from audio mixtures and transcribing their lead melodies, respectively. Despite their shared focus on singing signals, these tasks possess distinct optimization objectives. Instead of training a unified model, we adopt a two-step approach. Initially, we train a vocal separation model, which subsequently serves as a foundation model for fine-tuning for vocal melody transcription. Through extensive experiments conducted on benchmark datasets, we showcase that our models achieve state-of-the-art performance in both vocal separation and melody transcription tasks, underscoring the efficacy and versatility of Mel-RoFormer in modeling complex music audio signals.
