Table of Contents
Fetching ...

Spectrogram features for audio and speech analysis

Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song, Donny Soh

Abstract

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.

Spectrogram features for audio and speech analysis

Abstract

Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.
Paper Structure (29 sections, 8 equations, 3 figures, 9 tables)

This paper contains 29 sections, 8 equations, 3 figures, 9 tables.

Figures (3)

  • Figure S1: Illustration of spectrogram creation from input audio data, as a stack of frequency vectors.
  • Figure S2: Illustrations of two dimensional time-frequency spectrograms based on (a) stabilised auditory image, (b) Constant-Q transform, (c) Mel-scaled spectrogram, (d) stacked MFCC, (e) Linear magnitude spectrogram.
  • Figure S3: High level system diagram showing spectrogram features (a) being extracted from an input waveform as a stack of scaled transforms from windowed speech regions then (b) features gathered from patches, pooled regions or a downsampled spectrogram image for (c) input to a deep learning classification pipeline.