Table of Contents
Fetching ...

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

Gakusei Sato, Taketo Akama

TL;DR

The paper tackles the challenge of automatic music transcription in low-resource domains by removing the dependency on MIDI–audio paired data. It introduces a framework that pre-trains on scalable synthetic audio generated from MIDI and one-shot timbres, followed by adversarial domain confusion to align synthetic and real audio representations using unannotated data. A Transformer-based transcription model converts mel-spectrogram inputs into MIDI-like tokens, with a discriminator guiding domain-invariant features during fine-tuning. Across multiple datasets, the proposed Synthetic-DC approach achieves competitive results in a target-annotation-free setting and reveals instrument-specific effects of timbre and MIDI variation on transcription performance, suggesting a viable path toward more general AMT systems without costly MIDI annotations.

Abstract

Automatic Music Transcription (AMT) is a vital technology in the field of music information processing. Despite recent enhancements in performance due to machine learning techniques, current methods typically attain high accuracy in domains where abundant annotated data is available. Addressing domains with low or no resources continues to be an unresolved challenge. To tackle this issue, we propose a transcription model that does not require any MIDI-audio paired data through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio in the target data domain. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio. Additionally, ablation studies have provided insights into the scalability of this approach and the forthcoming challenges in the field of AMT research.

Annotation-free Automatic Music Transcription with Scalable Synthetic Data and Adversarial Domain Confusion

TL;DR

The paper tackles the challenge of automatic music transcription in low-resource domains by removing the dependency on MIDI–audio paired data. It introduces a framework that pre-trains on scalable synthetic audio generated from MIDI and one-shot timbres, followed by adversarial domain confusion to align synthetic and real audio representations using unannotated data. A Transformer-based transcription model converts mel-spectrogram inputs into MIDI-like tokens, with a discriminator guiding domain-invariant features during fine-tuning. Across multiple datasets, the proposed Synthetic-DC approach achieves competitive results in a target-annotation-free setting and reveals instrument-specific effects of timbre and MIDI variation on transcription performance, suggesting a viable path toward more general AMT systems without costly MIDI annotations.

Abstract

Automatic Music Transcription (AMT) is a vital technology in the field of music information processing. Despite recent enhancements in performance due to machine learning techniques, current methods typically attain high accuracy in domains where abundant annotated data is available. Addressing domains with low or no resources continues to be an unresolved challenge. To tackle this issue, we propose a transcription model that does not require any MIDI-audio paired data through the utilization of scalable synthetic audio for pre-training and adversarial domain confusion using unannotated real audio. In experiments, we evaluate methods under the real-world application scenario where training datasets do not include the MIDI annotation of audio in the target data domain. Our proposed method achieved competitive performance relative to established baseline methods, despite not utilizing any real datasets of paired MIDI-audio. Additionally, ablation studies have provided insights into the scalability of this approach and the forthcoming challenges in the field of AMT research.
Paper Structure (15 sections, 1 equation, 1 figure, 1 table)