Table of Contents
Fetching ...

Automatic Detection and Analysis of Singing Mistakes for Music Pedagogy

Sumit Kumar, Suraj Jaiswal, Parampreet Singh, Vipul Arora

TL;DR

The paper tackles automatic detection of singing mistakes in Indian Art Music by introducing the M3 dataset of synchronized teacher–learner recordings with frame-level annotations for pitch and amplitude errors. It benchmarks rule-based and deep learning approaches (CNN, CRNN, TCN) under a collar-based evaluation framework, demonstrating that learning-based methods outperform baselines and that temporal models (TCN) capture error continuity effectively. A systematic analysis across data splits and cross-teacher settings reveals both generalizable patterns and teacher-specific annotation tolerances, guiding pedagogy-focused feedback. The work provides a publicly available dataset, models, and evaluation methodology that can inform practical, interpretable, and real-time feedback tools in music education.

Abstract

The advancement of machine learning in audio analysis has opened new possibilities for technology-enhanced music education. This paper introduces a framework for automatic singing mistake detection in the context of music pedagogy, supported by a newly curated dataset. The dataset comprises synchronized teacher learner vocal recordings, with annotations marking different types of mistakes made by learners. Using this dataset, we develop different deep learning models for mistake detection and benchmark them. To compare the efficacy of mistake detection systems, a new evaluation methodology is proposed. Experiments indicate that the proposed learning-based methods are superior to rule-based methods. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilised for various music applications. This work sets out new directions of research in music pedagogy. The codes and dataset are publicly available.

Automatic Detection and Analysis of Singing Mistakes for Music Pedagogy

TL;DR

The paper tackles automatic detection of singing mistakes in Indian Art Music by introducing the M3 dataset of synchronized teacher–learner recordings with frame-level annotations for pitch and amplitude errors. It benchmarks rule-based and deep learning approaches (CNN, CRNN, TCN) under a collar-based evaluation framework, demonstrating that learning-based methods outperform baselines and that temporal models (TCN) capture error continuity effectively. A systematic analysis across data splits and cross-teacher settings reveals both generalizable patterns and teacher-specific annotation tolerances, guiding pedagogy-focused feedback. The work provides a publicly available dataset, models, and evaluation methodology that can inform practical, interpretable, and real-time feedback tools in music education.

Abstract

The advancement of machine learning in audio analysis has opened new possibilities for technology-enhanced music education. This paper introduces a framework for automatic singing mistake detection in the context of music pedagogy, supported by a newly curated dataset. The dataset comprises synchronized teacher learner vocal recordings, with annotations marking different types of mistakes made by learners. Using this dataset, we develop different deep learning models for mistake detection and benchmark them. To compare the efficacy of mistake detection systems, a new evaluation methodology is proposed. Experiments indicate that the proposed learning-based methods are superior to rule-based methods. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilised for various music applications. This work sets out new directions of research in music pedagogy. The codes and dataset are publicly available.
Paper Structure (30 sections, 11 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Duration-wise distribution of recordings across learners associated with Teacher 1. The Learner ID (x-axis) represents an anonymized identifier assigned to each learner.
  • Figure 2: Duration-wise distribution of recordings across learners associated with Teacher 2. The Learner ID (x-axis) represents an anonymized identifier assigned to each learner.
  • Figure 3: Ground truth mistake distribution for all learners of each teacher where F: frequency mistakes, A: amplitude mistakes, P: pronunciation mistakes, T: timing mistakes, O: other mistakes, NM: no mistake
  • Figure 4: Illustration of collar-based frame-wise evaluation. Left: naive frame-wise evaluation without collars. Right: with a collar of $c=1$ frame, ground-truth mistake frames are dilated by one frame on both sides (gray arrows). For each predicted mistake frame (bottom row), we assign: True Positive (TP) if it overlaps any dilated ground-truth mistake frame, False Positive (FP) if it does not overlap any, and False Negative (FN) for a ground-truth mistake frame that is not predicted as a mistake even after dilation. True Negatives (TN) may be omitted as they are not used in metric computation.
  • Figure 5: Ground truth class-distribution for all four split scenarios. Within each scenario, the bar on the left corresponds to the training set and the bar on the right corresponds to the test set. F: frequency mistakes, A: amplitude mistakes, NM: no mistake.
  • ...and 1 more figures