Table of Contents
Fetching ...

Lead Instrument Detection from Multitrack Music

Longshen Ou, Yu Takahashi, Ye Wang

TL;DR

This work introduces lead instrument detection in multitrack music, addressing the limitations of mixture-based analyses by leveraging two expert-annotated datasets and a novel model that uses a shared SSL encoder across tracks coupled with track-wise attention. The approach, enhanced by instrument and track embeddings as well as track permutation augmentation, outperforms traditional SVM and CRNN baselines and generalizes to unseen instruments and out-of-domain data. The authors provide extensive ablations, cross-dataset evaluations, and insights into data design for multitrack audio tasks. The results suggest practical impact for audio mixing, structure analysis, and music recommendation in complex multitrack scenarios.

Abstract

Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.

Lead Instrument Detection from Multitrack Music

TL;DR

This work introduces lead instrument detection in multitrack music, addressing the limitations of mixture-based analyses by leveraging two expert-annotated datasets and a novel model that uses a shared SSL encoder across tracks coupled with track-wise attention. The approach, enhanced by instrument and track embeddings as well as track permutation augmentation, outperforms traditional SVM and CRNN baselines and generalizes to unseen instruments and out-of-domain data. The authors provide extensive ablations, cross-dataset evaluations, and insights into data design for multitrack audio tasks. The results suggest practical impact for audio mixing, structure analysis, and music recommendation in complex multitrack scenarios.

Abstract

Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.

Paper Structure

This paper contains 17 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Encoding information for each track.
  • Figure 2: Track-wise frame-level attention and subsequent classification.