Table of Contents
Fetching ...

Learning Temporal Resolution in Spectrogram for Audio Classification

Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

TL;DR

The paper tackles the suboptimal fixed temporal resolution of spectrograms for audio classification and introduces DiffRes, a differentiable module that learns a variable temporal resolution by adaptively merging time frames. DiffRes uses a frame-importance estimator to produce scores, a warp-based temporal frame merging mechanism, and a resolution encoding, all trained end-to-end with the classifier, with reduction factor $\delta = (T-t)/T$. Across five tasks, DiffRes yields equivalent or higher accuracy while reducing temporal frames by at least 25%, and can even improve accuracy when using higher-resolution inputs at similar cost. The approach offers practical computational savings, improves efficiency for variable-length audio, and opens possibilities for extending differentiable temporal-resolution learning to other time-series modalities.

Abstract

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

Learning Temporal Resolution in Spectrogram for Audio Classification

TL;DR

The paper tackles the suboptimal fixed temporal resolution of spectrograms for audio classification and introduces DiffRes, a differentiable module that learns a variable temporal resolution by adaptively merging time frames. DiffRes uses a frame-importance estimator to produce scores, a warp-based temporal frame merging mechanism, and a resolution encoding, all trained end-to-end with the classifier, with reduction factor . Across five tasks, DiffRes yields equivalent or higher accuracy while reducing temporal frames by at least 25%, and can even improve accuracy when using higher-resolution inputs at similar cost. The approach offers practical computational savings, improves efficiency for variable-length audio, and opens possibilities for extending differentiable temporal-resolution learning to other time-series modalities.

Abstract

The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.
Paper Structure (13 sections, 10 equations, 7 figures, 3 tables)

This paper contains 13 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The spectrogram of Alarm Clock and Siren sound with $40$ ms and $10$ ms hop sizes. All with a $25$ ms window size. The pattern of Siren, which is relatively stable, does not change significantly using a smaller hop size (i.e., larger temporal resolution), while Alarm Clock is the opposite.
  • Figure 2: Audio classification with DiffRes and mel-spectrogram. Green blocks contain learnable parameters. DiffRes is a "drop-in" module between spectrogram calculation and the downstream task.
  • Figure 3: Visualizations of the DiffRes using the mel-spectrogram. The part with the shaded background is the input features.
  • Figure 4: Audio throughput in one second. Evaluated on a 2.6 GHz Intel Core i7 CPU.
  • Figure 5: Trajectories of DiffRes learning activeness ($\rho$) on different training steps and FPS settings.
  • ...and 2 more figures