Table of Contents
Fetching ...

The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

Bob L. Sturm

TL;DR

The paper scrutinizes the GTZAN dataset, a dominant benchmark in music genre recognition, and uncovers substantial content-related faults that threaten evaluation validity. It develops metadata-driven analyses, top-tag insights, and a mislabeling scoring framework, demonstrating that faults unevenly affect different MGR systems and undermine cross-system comparability. Through fault-aware experiments on multiple systems and an estimate of a near-ideal classifier performance around 94.5%, the work argues for content-aware evaluation and richer dataset metadata rather than discarding GTZAN. It concludes that GTZAN can still be valuable if used with awareness and enriched with metadata, and it provides practical guidance for future research in related music understanding tasks.

Abstract

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

TL;DR

The paper scrutinizes the GTZAN dataset, a dominant benchmark in music genre recognition, and uncovers substantial content-related faults that threaten evaluation validity. It develops metadata-driven analyses, top-tag insights, and a mislabeling scoring framework, demonstrating that faults unevenly affect different MGR systems and undermine cross-system comparability. Through fault-aware experiments on multiple systems and an estimate of a near-ideal classifier performance around 94.5%, the work argues for content-aware evaluation and richer dataset metadata rather than discarding GTZAN. It concludes that GTZAN can still be valuable if used with awareness and enriched with metadata, and it provides practical guidance for future research in related music understanding tasks.

Abstract

The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

Paper Structure

This paper contains 19 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Artist composition of each GTZAN category. We do not include unidentified excerpts.
  • Figure 2: Top tags of each GTZAN category. We do not include unidentified excerpts.
  • Figure 3: Annual numbers of published works in MGR with experimental components, divided into ones that use and do no use GTZAN.
  • Figure 4: Highest classification accuracies (y-axis) reported (cross-references labeled) with experimental design Classify using all GTZAN. Shapes (legend) denote particular details of the experimental procedure, e.g., "2fCV" is two-fold cross validation; "other" means randomly partitioning data into training/validation/test sets, or an unspecified experimental procedure. Five "x" denote results that have been challenged, and/or shown to be invalid. Solid gray line is our estimate of the "perfect" accuracy in Table \ref{['tab:problems1']}. Dashed gray line is the best accuracy of the five systems in Section \ref{['sec:effects']} that we evaluate using fault filtering.
  • Figure 5: Normalized accuracy (\ref{['eq:accuracy']}) of each system (x-axis) for each fold (left and right) of different partitioning (legend).
  • ...and 2 more figures