Table of Contents
Fetching ...

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

TL;DR

The paper tackles the high computational cost of training Audio Spectrogram Transformers (AST) by introducing a coarse-to-fine, multi-phase training paradigm that progressively increases temporal resolution. It employs temporal compression methods—Frame-Shift, pooling, and flexible patchification—and uses curriculum learning with careful weight transfer (positional embedding interpolation) across phases. Across AudioSet, VGGSound, VoxCeleb, and Kinetics-Sounds, the approach achieves substantial FLOPs/time savings (18–58%) with on-par or improved accuracy and generalizes to HTS-AT and SSAST. This work offers a practical pathway to scalable AST training and motivates future work on learnable phase schedulers to further optimize training efficiency.

Abstract

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models. To achieve this, we propose a set of methods for temporal compression. By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy. Experimental results demonstrate that the proposed training mechanism for AST leads to improved (or on-par) performance with faster convergence, i.e. requiring fewer computational resources and less time. This approach is also generalizable to other AST-based methods regardless of their learning paradigms.

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

TL;DR

The paper tackles the high computational cost of training Audio Spectrogram Transformers (AST) by introducing a coarse-to-fine, multi-phase training paradigm that progressively increases temporal resolution. It employs temporal compression methods—Frame-Shift, pooling, and flexible patchification—and uses curriculum learning with careful weight transfer (positional embedding interpolation) across phases. Across AudioSet, VGGSound, VoxCeleb, and Kinetics-Sounds, the approach achieves substantial FLOPs/time savings (18–58%) with on-par or improved accuracy and generalizes to HTS-AT and SSAST. This work offers a practical pathway to scalable AST training and motivates future work on learnable phase schedulers to further optimize training efficiency.

Abstract

Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models. To achieve this, we propose a set of methods for temporal compression. By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy. Experimental results demonstrate that the proposed training mechanism for AST leads to improved (or on-par) performance with faster convergence, i.e. requiring fewer computational resources and less time. This approach is also generalizable to other AST-based methods regardless of their learning paradigms.
Paper Structure (13 sections, 3 equations, 1 figure, 6 tables)

This paper contains 13 sections, 3 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Illustration of initial and final phase pipelines in our proposed training method.Fshift, Pool, and Patch are compression methods from Section \ref{['compressMethods']}. In the initial training phases, only one of them will be employed to get $f \times \frac{t}{C}$ number of tokens. Each method's unique contribution compared to the original pipeline is color-highlighted. Given numbers reflect the AST's original training settings.