Table of Contents
Fetching ...

A Lightweight Architecture for Multi-instrument Transcription with Practical Optimizations

Ruigang Li, Yongxu Zhu

TL;DR

This work tackles practical automatic music transcription for mixtures with multiple timbres by introducing a lightweight two-branch architecture: a timbre-agnostic transcription backbone and a separate timbre-encoding branch that forms note-level timbre clusters via contrastive learning. Key optimizations—EnergyNorm spectral normalization, two-octave dilated harmonic context, tuned focal loss, and InfoNCE-based clustering—deliver strong transcription and timbre separation with a small footprint suitable for low-resource deployment. The approach demonstrates competitive performance against heavier baselines on both synthetic and real datasets, with ablations showing clear gains from the proposed components. The work also discusses limitations of synthetic data, explores learnable time–frequency representations, and outlines directions for improving robustness and scalability in real-world settings.

Abstract

Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.

A Lightweight Architecture for Multi-instrument Transcription with Practical Optimizations

TL;DR

This work tackles practical automatic music transcription for mixtures with multiple timbres by introducing a lightweight two-branch architecture: a timbre-agnostic transcription backbone and a separate timbre-encoding branch that forms note-level timbre clusters via contrastive learning. Key optimizations—EnergyNorm spectral normalization, two-octave dilated harmonic context, tuned focal loss, and InfoNCE-based clustering—deliver strong transcription and timbre separation with a small footprint suitable for low-resource deployment. The approach demonstrates competitive performance against heavier baselines on both synthetic and real datasets, with ablations showing clear gains from the proposed components. The work also discusses limitations of synthetic data, explores learnable time–frequency representations, and outlines directions for improving robustness and scalability in real-world settings.

Abstract

Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.

Paper Structure

This paper contains 35 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed method. Top: Overall pipeline, which takes a multi-timbral mixture audio as input and outputs note events for each constituent timbre. Bottom left: AMT branch producing timbre-agnostic transcription outputs—frame activation posteriorgram $Y_F$ and onset activation posteriorgram $Y_O$. Bottom right: Timbre encoding branch yielding a $D$-dimensional timbre embedding $V_{ti}$ for each time–frequency bin, where $N=84$ denotes the target pitch range.
  • Figure 2: Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence.
  • Figure 3: Result of frame-level and note-level post-processing for triple separation.
  • Figure 4: Randomly generated piano-roll (left) and the corresponding CQT spectrogram of the synthesized audio using a trumpet timbre (right)
  • Figure 5: t-SNE visualization of timbre embeddings. (a) Frame-level embeddings for BACH10 Piece 2; (b) note-level aggregates of (a); (c) frame-level for URMP Piece 18; (d) frame-level for URMP Piece 18 using top-$k$ attention.