Table of Contents
Fetching ...

Persian Musical Instruments Classification Using Polyphonic Data Augmentation

Diba Hadi Esfangereh, Mohammad Hossein Sameti, Sepehr Harfi Moridani, Leili Javidpour, Mahdieh Soleymani Baghshah

TL;DR

This work tackles the scarcity of Persian polyphonic instrument data by constructing a monophonic dataset of ten Persian instruments and vocal sounds, then applying culturally informed polyphonic augmentation that enforces shared dastgāh and tempo to simulate authentic ensembles. Using the MERT model with a multi-label head and layer-weighted representations, the authors show that tonal coherence (dastgāh alignment) substantially improves instrument recognition, with tempo alignment providing additional gains when combined. The best-performing configuration achieves a ROC-AUC of 0.795 on real Persian polyphonic recordings, while dastgāh-alone yields the highest accuracy and F1, demonstrating complementary effects of tonal and rhythmic constraints. Overall, the paper highlights the value of culturally grounded augmentation for robustness in low-resource MIR settings and offers dataset and methods to advance cross-cultural instrument tagging and generation.

Abstract

Musical instrument classification is essential for music information retrieval (MIR) and generative music systems. However, research on non-Western traditions, particularly Persian music, remains limited. We address this gap by introducing a new dataset of isolated recordings covering seven traditional Persian instruments, two common but originally non-Persian instruments (i.e., violin, piano), and vocals. We propose a culturally informed data augmentation strategy that generates realistic polyphonic mixtures from monophonic samples. Using the MERT model (Music undERstanding with large-scale self-supervised Training) with a classification head, we evaluate our approach with out-of-distribution data which was obtained by manually labeling segments of traditional songs. On real-world polyphonic Persian music, the proposed method yielded the best ROC-AUC (0.795), highlighting complementary benefits of tonal and temporal coherence. These results demonstrate the effectiveness of culturally grounded augmentation for robust Persian instrument recognition and provide a foundation for culturally inclusive MIR and diverse music generation systems.

Persian Musical Instruments Classification Using Polyphonic Data Augmentation

TL;DR

This work tackles the scarcity of Persian polyphonic instrument data by constructing a monophonic dataset of ten Persian instruments and vocal sounds, then applying culturally informed polyphonic augmentation that enforces shared dastgāh and tempo to simulate authentic ensembles. Using the MERT model with a multi-label head and layer-weighted representations, the authors show that tonal coherence (dastgāh alignment) substantially improves instrument recognition, with tempo alignment providing additional gains when combined. The best-performing configuration achieves a ROC-AUC of 0.795 on real Persian polyphonic recordings, while dastgāh-alone yields the highest accuracy and F1, demonstrating complementary effects of tonal and rhythmic constraints. Overall, the paper highlights the value of culturally grounded augmentation for robustness in low-resource MIR settings and offers dataset and methods to advance cross-cultural instrument tagging and generation.

Abstract

Musical instrument classification is essential for music information retrieval (MIR) and generative music systems. However, research on non-Western traditions, particularly Persian music, remains limited. We address this gap by introducing a new dataset of isolated recordings covering seven traditional Persian instruments, two common but originally non-Persian instruments (i.e., violin, piano), and vocals. We propose a culturally informed data augmentation strategy that generates realistic polyphonic mixtures from monophonic samples. Using the MERT model (Music undERstanding with large-scale self-supervised Training) with a classification head, we evaluate our approach with out-of-distribution data which was obtained by manually labeling segments of traditional songs. On real-world polyphonic Persian music, the proposed method yielded the best ROC-AUC (0.795), highlighting complementary benefits of tonal and temporal coherence. These results demonstrate the effectiveness of culturally grounded augmentation for robust Persian instrument recognition and provide a foundation for culturally inclusive MIR and diverse music generation systems.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Dataset construction pipeline - more details at Appendix A.
  • Figure 2: Accuracy of different models across varying decision thresholds. Each curve represents a distinct model configuration.