Table of Contents
Fetching ...

voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri, Shrikanth Narayanan

TL;DR

This work evaluates the transferability of speech foundation models to singing phonation classification, and shows layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

Abstract

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models

TL;DR

This work evaluates the transferability of speech foundation models to singing phonation classification, and shows layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).

Abstract

We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-specific neural nets; this work evaluates the transferability of speech foundation models to singing phonation classification. voice2mode extracts layer-wise representations from HuBERT and two wav2vec2 variants, applies global temporal pooling, and classifies the pooled embeddings with lightweight classifiers (SVM, XGBoost). Experiments on a publicly available soprano dataset (763 sustained vowel recordings, four labels) show that foundation-model features substantially outperform conventional spectral baselines (spectrogram, mel-spectrogram, MFCC). HuBERT embeddings obtained from early layers yield the best result (~95.7% accuracy with SVM), an absolute improvement of ~12-15% over the best traditional baseline. We also show layer-wise behaviour: lower layers, which retain acoustic/phonetic detail, are more effective than top layers specialized for Automatic Speech Recognition (ASR).
Paper Structure (11 sections, 4 figures, 2 tables)

This paper contains 11 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Schematic block diagram illustrating the proposed voice2mode phonation mode classification system for singing voice. The system utilizes foundation models (HuBERT and wav2vec2.0) as feature extractor, and SVM and XGBoost as classifiers. The raw signals are used directly as input to the Feature Extractor component that is based on pre-trained self-supervised models (wav2vec2 and HuBERT), and the two classifiers (SVM and XBG).
  • Figure 2: Layer-wise classification accuracies for the features derived from the three pre-trained models: wav2vec2-BASE, wav2vec2-LARGE and HuBERT using SVM classifier.
  • Figure 3: Layer-wise classification accuracies for the features derived from the three pre-trained models: wav2vec2-BASE, wav2vec2-LARGE and HuBERT using XGB classifier.
  • Figure 4: Confusion matrices for the best-performing baseline spectrogram feature (left) and for the best performing layer of HuBERT feature (right) using the SVM classifier.