Table of Contents
Fetching ...

Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification

Mingwen Dong

TL;DR

This paper addresses automatic music genre classification by training a convolutional neural network on log-mel spectrogram segments and aggregating segment-level predictions to classify entire tracks. The approach integrates psychophysics and neurophysiology, aiming for STRF-like filters and improved discrimination through 3-second segment analysis. It achieves human-level accuracy (~70%) on 10 genres, outperforming prior methods, and demonstrates that learned features remap the input into a linearly separable representation. The work suggests practical impact for music recommendation and broader MIR tasks while supporting biological plausibility of learned spectral-temporal features.

Abstract

Music genre classification is one example of content-based analysis of music signals. Traditionally, human-engineered features were used to automatize this task and 61% accuracy has been achieved in the 10-genre classification. However, it's still below the 70% accuracy that humans could achieve in the same task. Here, we propose a new method that combines knowledge of human perception study in music genre classification and the neurophysiology of the auditory system. The method works by training a simple convolutional neural network (CNN) to classify a short segment of the music signal. Then, the genre of a music is determined by splitting it into short segments and then combining CNN's predictions from all short segments. After training, this method achieves human-level (70%) accuracy and the filters learned in the CNN resemble the spectrotemporal receptive field (STRF) in the auditory system.

Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification

TL;DR

This paper addresses automatic music genre classification by training a convolutional neural network on log-mel spectrogram segments and aggregating segment-level predictions to classify entire tracks. The approach integrates psychophysics and neurophysiology, aiming for STRF-like filters and improved discrimination through 3-second segment analysis. It achieves human-level accuracy (~70%) on 10 genres, outperforming prior methods, and demonstrates that learned features remap the input into a linearly separable representation. The work suggests practical impact for music recommendation and broader MIR tasks while supporting biological plausibility of learned spectral-temporal features.

Abstract

Music genre classification is one example of content-based analysis of music signals. Traditionally, human-engineered features were used to automatize this task and 61% accuracy has been achieved in the 10-genre classification. However, it's still below the 70% accuracy that humans could achieve in the same task. Here, we propose a new method that combines knowledge of human perception study in music genre classification and the neurophysiology of the auditory system. The method works by training a simple convolutional neural network (CNN) to classify a short segment of the music signal. Then, the genre of a music is determined by splitting it into short segments and then combining CNN's predictions from all short segments. After training, this method achieves human-level (70%) accuracy and the filters learned in the CNN resemble the spectrotemporal receptive field (STRF) in the auditory system.

Paper Structure

This paper contains 10 sections, 5 figures.

Figures (5)

  • Figure 1: Convert waveform into mel-spectrogram and an example 3-second segment. Mel-spectrogram mimics how human ear works, with high precision in low frequency band and low precision in high frequency band. Note, the mel-spectrogram shown in the figures is already log transformed.
  • Figure 2: Confusion matrix of the CNN classification on testing set.
  • Figure 3: Filters learned by the CNN are similar to the STRF from physiological experiments. Mel scale corresponds to frequency and relative time corresponds to latency in figure \ref{['fig: real_strf']}.
  • Figure 4: STRF obtained from physiological experiments. From left to right are the STRFs obtained from lower to higher auditory structures. Adapted from theunissen2014neural with permission.
  • Figure 5: Comparison between the separability of the raw representation and last layer representation of the CNN of the testing data. The axes are the first three components when data is projected onto the directions obtained from linear discriminant analysis (LDA). using training data.