Table of Contents
Fetching ...

Mel-RoFormer for Vocal Separation and Vocal Melody Transcription

Ju-Chiang Wang, Wei-Tsung Lu, Jitong Chen

TL;DR

Mel-RoFormer tackles the challenge of modeling complex musical spectra by explicitly modeling frequency and time as separate sequences using interleaved RoPE Transformers and a learnable Mel-band front-end. Operating on complex spectrograms, it estimates complex masks and reconstructs separated voices, while leveraging a two-stage pipeline to achieve high performance on vocal separation and vocal melody transcription. The model combines a Mel-band Projection, RoFormer blocks, and an Embedding Projection to produce robust embeddings for downstream tasks, with multi-resolution STFT losses guiding separation and an onset-frame approach guiding transcription. Results on MUSDB18HQ, MIR-ST500, and POP909 demonstrate state-of-the-art performance, and the authors highlight Mel-RoFormer’s potential as a versatile MIR foundation model for related tasks such as chord recognition and multi-instrument transcription.

Abstract

Developing a versatile deep neural network to model music audio is crucial in MIR. This task is challenging due to the intricate spectral variations inherent in music signals, which convey melody, harmonics, and timbres of diverse instruments. In this paper, we introduce Mel-RoFormer, a spectrogram-based model featuring two key designs: a novel Mel-band Projection module at the front-end to enhance the model's capability to capture informative features across multiple frequency bands, and interleaved RoPE Transformers to explicitly model the frequency and time dimensions as two separate sequences. We apply Mel-RoFormer to tackle two essential MIR tasks: vocal separation and vocal melody transcription, aimed at isolating singing voices from audio mixtures and transcribing their lead melodies, respectively. Despite their shared focus on singing signals, these tasks possess distinct optimization objectives. Instead of training a unified model, we adopt a two-step approach. Initially, we train a vocal separation model, which subsequently serves as a foundation model for fine-tuning for vocal melody transcription. Through extensive experiments conducted on benchmark datasets, we showcase that our models achieve state-of-the-art performance in both vocal separation and melody transcription tasks, underscoring the efficacy and versatility of Mel-RoFormer in modeling complex music audio signals.

Mel-RoFormer for Vocal Separation and Vocal Melody Transcription

TL;DR

Mel-RoFormer tackles the challenge of modeling complex musical spectra by explicitly modeling frequency and time as separate sequences using interleaved RoPE Transformers and a learnable Mel-band front-end. Operating on complex spectrograms, it estimates complex masks and reconstructs separated voices, while leveraging a two-stage pipeline to achieve high performance on vocal separation and vocal melody transcription. The model combines a Mel-band Projection, RoFormer blocks, and an Embedding Projection to produce robust embeddings for downstream tasks, with multi-resolution STFT losses guiding separation and an onset-frame approach guiding transcription. Results on MUSDB18HQ, MIR-ST500, and POP909 demonstrate state-of-the-art performance, and the authors highlight Mel-RoFormer’s potential as a versatile MIR foundation model for related tasks such as chord recognition and multi-instrument transcription.

Abstract

Developing a versatile deep neural network to model music audio is crucial in MIR. This task is challenging due to the intricate spectral variations inherent in music signals, which convey melody, harmonics, and timbres of diverse instruments. In this paper, we introduce Mel-RoFormer, a spectrogram-based model featuring two key designs: a novel Mel-band Projection module at the front-end to enhance the model's capability to capture informative features across multiple frequency bands, and interleaved RoPE Transformers to explicitly model the frequency and time dimensions as two separate sequences. We apply Mel-RoFormer to tackle two essential MIR tasks: vocal separation and vocal melody transcription, aimed at isolating singing voices from audio mixtures and transcribing their lead melodies, respectively. Despite their shared focus on singing signals, these tasks possess distinct optimization objectives. Instead of training a unified model, we adopt a two-step approach. Initially, we train a vocal separation model, which subsequently serves as a foundation model for fine-tuning for vocal melody transcription. Through extensive experiments conducted on benchmark datasets, we showcase that our models achieve state-of-the-art performance in both vocal separation and melody transcription tasks, underscoring the efficacy and versatility of Mel-RoFormer in modeling complex music audio signals.
Paper Structure (16 sections, 3 equations, 3 figures, 4 tables)

This paper contains 16 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The diagram of Mel-RoFormer, which is consist of three major modules: Mel-band Projection, RoFormer Blocks, and Embedding Projection. The input is a Complex Spectrogram, and the output is an Embedding tensor, which can be rearranged into the desired shape.
  • Figure 2: Illustration of Mel filter-bank with 7 bands. In this example, the length of frequency bins is 1024. Here, the frequency bins from 1 to 46 are encompassed by the 0-th Mel-band (i.e. $\mathcal{F}_0$), those from 24 to 77 are encompassed by the 1-th Mel-band (i.e. $\mathcal{F}_1$), and so forth.
  • Figure :