Table of Contents
Fetching ...

Source Separation & Automatic Transcription for Music

Bradford Derby, Lucas Dunker, Samarth Galchar, Shashank Jarmale, Akash Setti

TL;DR

The paper tackles the problem of automatically deriving musical sheets from audio by combining source separation and automatic transcription in an end-to-end pipeline. It uses spectrogram masking and deep networks to separate a vocal stem from mixtures, augments training data with on-the-fly mixing on MUSDB18, and employs a MAESTRO-based AMT module to produce binary piano rolls that are converted to MIDI and then to sheet music via MuseScore. Key contributions include a detailed integration of STFT/CQT-based processing, a vocal-separation model with an LSTM backbone, an AMT model trained on MAESTRO with focal loss and active-frame metrics, and a MuseScore-driven transcription-to-sheet-music path. The work demonstrates the feasibility of a complete pipeline from audio to printable notation and discusses practical limitations such as dataset-domain mismatch and computational constraints, outlining concrete future improvements. Overall, the pipeline lays groundwork for interactive music production and analysis workflows by enabling automated stem separation, transcription, and notation generation.

Abstract

Source separation is the process of isolating individual sounds in an auditory mixture of multiple sounds [1], and has a variety of applications ranging from speech enhancement and lyric transcription [2] to digital audio production for music. Furthermore, Automatic Music Transcription (AMT) is the process of converting raw music audio into sheet music that musicians can read [3]. Historically, these tasks have faced challenges such as significant audio noise, long training times, and lack of free-use data due to copyright restrictions. However, recent developments in deep learning have brought new promising approaches to building low-distortion stems and generating sheet music from audio signals [4]. Using spectrogram masking, deep neural networks, and the MuseScore API, we attempt to create an end-to-end pipeline that allows for an initial music audio mixture (e.g...wav file) to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.

Source Separation & Automatic Transcription for Music

TL;DR

The paper tackles the problem of automatically deriving musical sheets from audio by combining source separation and automatic transcription in an end-to-end pipeline. It uses spectrogram masking and deep networks to separate a vocal stem from mixtures, augments training data with on-the-fly mixing on MUSDB18, and employs a MAESTRO-based AMT module to produce binary piano rolls that are converted to MIDI and then to sheet music via MuseScore. Key contributions include a detailed integration of STFT/CQT-based processing, a vocal-separation model with an LSTM backbone, an AMT model trained on MAESTRO with focal loss and active-frame metrics, and a MuseScore-driven transcription-to-sheet-music path. The work demonstrates the feasibility of a complete pipeline from audio to printable notation and discusses practical limitations such as dataset-domain mismatch and computational constraints, outlining concrete future improvements. Overall, the pipeline lays groundwork for interactive music production and analysis workflows by enabling automated stem separation, transcription, and notation generation.

Abstract

Source separation is the process of isolating individual sounds in an auditory mixture of multiple sounds [1], and has a variety of applications ranging from speech enhancement and lyric transcription [2] to digital audio production for music. Furthermore, Automatic Music Transcription (AMT) is the process of converting raw music audio into sheet music that musicians can read [3]. Historically, these tasks have faced challenges such as significant audio noise, long training times, and lack of free-use data due to copyright restrictions. However, recent developments in deep learning have brought new promising approaches to building low-distortion stems and generating sheet music from audio signals [4]. Using spectrogram masking, deep neural networks, and the MuseScore API, we attempt to create an end-to-end pipeline that allows for an initial music audio mixture (e.g...wav file) to be separated into instrument stems, converted into MIDI files, and transcribed into sheet music for each component instrument.

Paper Structure

This paper contains 28 sections, 5 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: The components of the MUSDB18 dataset [5]
  • Figure 2: The process of computing a short-time Fourier transform of a waveform [6]
  • Figure 3: Applying a mask to a spectrogram signal creates a new signal which is a subset of the original [9]
  • Figure 4: Our evaluation metric results for source separation
  • Figure 5: Visualization of a spectrogram using Constant-Q Transform (CQT)
  • ...and 3 more figures