Lead Instrument Detection from Multitrack Music
Longshen Ou, Yu Takahashi, Ye Wang
TL;DR
This work introduces lead instrument detection in multitrack music, addressing the limitations of mixture-based analyses by leveraging two expert-annotated datasets and a novel model that uses a shared SSL encoder across tracks coupled with track-wise attention. The approach, enhanced by instrument and track embeddings as well as track permutation augmentation, outperforms traditional SVM and CRNN baselines and generalizes to unseen instruments and out-of-domain data. The authors provide extensive ablations, cross-dataset evaluations, and insights into data design for multitrack audio tasks. The results suggest practical impact for audio mixing, structure analysis, and music recommendation in complex multitrack scenarios.
Abstract
Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.
