Table of Contents
Fetching ...

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

Frank Cwitkowitz, Kin Wai Cheuk, Woosung Choi, Marco A. Martínez-Ramírez, Keisuke Toyama, Wei-Hsiang Liao, Yuki Mitsufuji

TL;DR

Timbre-Trap presents a low-resource, instrument-agnostic approach to multi-pitch estimation by unifying transcription and audio reconstruction in a single autoencoder with a simple switch to select between outputs. The method leverages invertible Complex CQT features to tie pitch salience to reconstructed spectral coefficients, enabling both accurate transcription and the synthesis of pitch-salience related spectra. Training combines transcription, reconstruction, and consistency objectives to enable learning from limited annotated data while preserving timbre information. Empirically, the framework achieves competitive results with state-of-the-art methods under data-scarce conditions and demonstrates the feasibility of reconstruction-guided transcription and potential for future disentanglement and semi-supervised extensions.

Abstract

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but these methods face the same data availability issues. We propose Timbre-Trap, a novel framework which unifies music transcription and audio reconstruction by exploiting the strong separability between pitch and timbre. We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients, selecting between either output during the decoding stage via a simple switch mechanism. In this way, the model learns to produce coefficients corresponding to timbre-less audio, which can be interpreted as pitch salience. We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods, while only requiring a small amount of annotated data.

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

TL;DR

Timbre-Trap presents a low-resource, instrument-agnostic approach to multi-pitch estimation by unifying transcription and audio reconstruction in a single autoencoder with a simple switch to select between outputs. The method leverages invertible Complex CQT features to tie pitch salience to reconstructed spectral coefficients, enabling both accurate transcription and the synthesis of pitch-salience related spectra. Training combines transcription, reconstruction, and consistency objectives to enable learning from limited annotated data while preserving timbre information. Empirically, the framework achieves competitive results with state-of-the-art methods under data-scarce conditions and demonstrates the feasibility of reconstruction-guided transcription and potential for future disentanglement and semi-supervised extensions.

Abstract

In recent years, research on music transcription has focused mainly on architecture design and instrument-specific data acquisition. With the lack of availability of diverse datasets, progress is often limited to solo-instrument tasks such as piano transcription. Several works have explored multi-instrument transcription as a means to bolster the performance of models on low-resource tasks, but these methods face the same data availability issues. We propose Timbre-Trap, a novel framework which unifies music transcription and audio reconstruction by exploiting the strong separability between pitch and timbre. We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients, selecting between either output during the decoding stage via a simple switch mechanism. In this way, the model learns to produce coefficients corresponding to timbre-less audio, which can be interpreted as pitch salience. We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods, while only requiring a small amount of annotated data.
Paper Structure (12 sections, 4 equations, 2 figures, 1 table)

This paper contains 12 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Proposed reconstruction-guided transcription framework. Audio is transformed into complex CQT coefficients, which are fed into an autoencoder to produce either reconstructed CQT coefficients or an estimated pitch salience, based on a binary switch.
  • Figure 2: t-SNE van2008visualizing visualization of the average latent across each monophonic stem within Bach10, colored by associated instrument.