A Lightweight Architecture for Multi-instrument Transcription with Practical Optimizations
Ruigang Li, Yongxu Zhu
TL;DR
This work tackles practical automatic music transcription for mixtures with multiple timbres by introducing a lightweight two-branch architecture: a timbre-agnostic transcription backbone and a separate timbre-encoding branch that forms note-level timbre clusters via contrastive learning. Key optimizations—EnergyNorm spectral normalization, two-octave dilated harmonic context, tuned focal loss, and InfoNCE-based clustering—deliver strong transcription and timbre separation with a small footprint suitable for low-resource deployment. The approach demonstrates competitive performance against heavier baselines on both synthetic and real datasets, with ablations showing clear gains from the proposed components. The work also discusses limitations of synthetic data, explores learnable time–frequency representations, and outlines directions for improving robustness and scalability in real-world settings.
Abstract
Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
