Table of Contents
Fetching ...

FAST: Fast Audio Spectrogram Transformer

Anugunj Naman, Gaibo Zhang

TL;DR

The paper tackles the challenge of real-time, robust audio classification by introducing FAST, a lightweight CNN–transformer hybrid that leverages Lipschitz-continuous attention components to stabilize training. It combines MobileNetV2-inspired convolutional feature extraction with patch-wise transformers, augmented by CenterNorm, Scaled Cosine Similarity Attention, and Weighted Residual Shortcuts to enforce a bounded Lipschitz constant. Empirically, FAST achieves competitive or state-of-the-art performance on ADIMA and AudioSet while using up to 150x fewer parameters, and benefits from improved training stability and faster convergence due to the Lipschitz components. The work highlights strong potential for on-device audio tasks and suggests future directions in transfer learning from vision models and few-shot multilingual adaptation.

Abstract

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.

FAST: Fast Audio Spectrogram Transformer

TL;DR

The paper tackles the challenge of real-time, robust audio classification by introducing FAST, a lightweight CNN–transformer hybrid that leverages Lipschitz-continuous attention components to stabilize training. It combines MobileNetV2-inspired convolutional feature extraction with patch-wise transformers, augmented by CenterNorm, Scaled Cosine Similarity Attention, and Weighted Residual Shortcuts to enforce a bounded Lipschitz constant. Empirically, FAST achieves competitive or state-of-the-art performance on ADIMA and AudioSet while using up to 150x fewer parameters, and benefits from improved training stability and faster convergence due to the Lipschitz components. The work highlights strong potential for on-device audio tasks and suggests future directions in transfer learning from vision models and few-shot multilingual adaptation.

Abstract

In audio classification, developing efficient and robust models is critical for real-time applications. Inspired by the design principles of MobileViT, we present FAST (Fast Audio Spectrogram Transformer), a new architecture that combines convolutional neural networks (CNNs) and transformers to capitalize on the strengths of both. FAST integrates the local feature extraction efficiencies of CNNs with the global context modeling capabilities of transformers, resulting in a model that is powerful yet lightweight, well-suited to a real-time or mobile use case. Additionally, we incorporate Lipschitz continuous attention mechanisms to improve training stability and accelerate convergence. We evaluate FAST on the ADIMA dataset, a multilingual corpus towards real-time profanity and abuse detection, as well as on the more traditional AudioSet. Our results show that FAST achieves state-of-the-art performance on both the ADIMA and AudioSet classification tasks and in some cases surpasses existing benchmarks while using up to 150x fewer parameters.
Paper Structure (17 sections, 7 equations, 2 figures, 4 tables)

This paper contains 17 sections, 7 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Architecture of the FAST model, highlighting its combination of CNNs and transformers with Lipschitz continuous attention mechanisms. The upper section illustrates the integration of these components, while the lower section presents the full architecture, including MobileNetV2 blocks and Lipschitz-modified blocks for enhanced efficiency and stability.
  • Figure 2: Comparison of training stability and efficiency in FAST with and without Lipschitz continuity components on the ADIMA Hindi language set. Loss is measured with binary cross-entropy (BCE).