Table of Contents
Fetching ...

Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

Sachin Prajuli, Abhishek Karna, OmPrakash Dhakl

Abstract

Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres--from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo--that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)--in which convolutional layers feed into an LSTM--achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal's musical traditions.

Music Genre Classification: A Comparative Analysis of Classical Machine Learning and Deep Learning Approaches

Abstract

Automatic music genre classification is a long-standing challenge in Music Information Retrieval (MIR); work on non-Western music traditions remains scarce. Nepali music encompasses culturally rich and acoustically diverse genres--from the call-and-response duets of Lok Dohori to the rhythmic poetry of Deuda and the distinctive melodies of Tamang Selo--that have not been addressed by existing classification systems. In this paper, we construct a novel dataset of approximately 8,000 labeled 30-second audio clips spanning eight Nepali music genres and conduct a systematic comparison of nine classification models across two paradigms. Five classical machine learning classifiers (Logistic Regression, SVM, KNN, Random Forest, and XGBoost) are trained on 51 hand-crafted audio features extracted via Librosa, while four deep learning architectures (CNN, RNN, parallel CNN-RNN, and sequential CNN followed by RNN) operate on Mel spectrograms of dimension 640 x 128. Our experiments reveal that the sequential Convolutional Recurrent Neural Network (CRNN)--in which convolutional layers feed into an LSTM--achieves the highest accuracy of 84%, substantially outperforming both the best classical models (Logistic Regression and XGBoost, both at 71%) and all other deep architectures. We provide per-class precision, recall, F1-score, confusion matrices, and ROC analysis for every model, and offer a culturally grounded interpretation of misclassification patterns that reflects genuine overlaps in Nepal's musical traditions.
Paper Structure (37 sections, 4 figures, 7 tables)

This paper contains 37 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Mel spectrograms of representative 30-second clips from each genre. Each spectrogram has dimensions $640 \times 128$ (time $\times$ Mel bands). Visible differences in energy distribution, harmonic structure, and temporal patterns across genres provide the discriminative cues that the deep learning models learn to exploit.
  • Figure 2: Confusion matrix for the CRNN model on the test set (800 samples). Diagonal values represent correct classifications. The most frequent confusions occur between Purbeli Bhaka and Deuda, and between Pop and Aadhunik Sangeet---pairs that share genuine cultural and acoustic overlap.
  • Figure 3: Per-class ROC curves for the CRNN model. All eight genres achieve AUC $\geq 0.92$, with Lok Dohori, Rap, and Rock reaching AUC $\geq 0.99$.
  • Figure 4: Training dynamics for the CRNN model. (a) Training and test accuracy over epochs. (b) Training and test loss over epochs. Test accuracy converges near 84% after $\sim$50 epochs.