Table of Contents
Fetching ...

Significance of Chirp MFCC as a Feature in Speech and Audio Applications

S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan

TL;DR

This work introduces chirp MFCC, a spectral feature formed by applying MFCC to the chirp magnitude spectrum instead of the traditional Fourier magnitude spectrum. Grounded in Z-transform theory, it shows that estimating spectra with a radius r near the dominant pole radii improves phase and magnitude accuracy, especially for decaying components. Through analytical results on single- and multi-pole models and extensive real-speech analysis, the authors identify an optimal radius rc near a_max and demonstrate practical gains on speech-music classification, speaker identification, and speech command recognition using both GMM and DNN pipelines. The findings indicate Chirp MFCC offers consistent, meaningful improvements over vanilla MFCC, suggesting broad utility for refined spectral representation in audio and speech applications. The approach combines theoretical insight with empirical validation, highlighting an actionable path to enhance MFCC-based features in real-world systems.

Abstract

A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with vanilla MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech-music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.

Significance of Chirp MFCC as a Feature in Speech and Audio Applications

TL;DR

This work introduces chirp MFCC, a spectral feature formed by applying MFCC to the chirp magnitude spectrum instead of the traditional Fourier magnitude spectrum. Grounded in Z-transform theory, it shows that estimating spectra with a radius r near the dominant pole radii improves phase and magnitude accuracy, especially for decaying components. Through analytical results on single- and multi-pole models and extensive real-speech analysis, the authors identify an optimal radius rc near a_max and demonstrate practical gains on speech-music classification, speaker identification, and speech command recognition using both GMM and DNN pipelines. The findings indicate Chirp MFCC offers consistent, meaningful improvements over vanilla MFCC, suggesting broad utility for refined spectral representation in audio and speech applications. The approach combines theoretical insight with empirical validation, highlighting an actionable path to enhance MFCC-based features in real-world systems.

Abstract

A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with vanilla MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech-music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.
Paper Structure (32 sections, 11 equations, 4 figures, 4 tables)

This paper contains 32 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The six cases considered for empirical analysis of the error in phase estimation in multi-pole systems. As the radii of the poles are varied for each scenario in a particular case, they move along the dotted line.
  • Figure 2: Eight complex conjugate poles of a synthesized signal. The solid line marks the unit circle, while the dotted line marks the analysis circle at radius $r_{c}=a_{max}$.
  • Figure 3: Histogram of the radius of the pole with the maximum radius, computed across 400 (1s long) utterances of the Google speech commands dataset.
  • Figure 4: Product of Gaussians showing the difference in percentage overlap offered by MFCC and chirp MFCC. The Gaussians correspond to the likelihoods of phone model M1 (/aa/) tested with examples of the same phone P1 (/aa/), and M1 tested with examples of a different phone P2 (/ih/).