Table of Contents
Fetching ...

ASM: Audio Spectrogram Mixer

Qingfeng Ji, Jicun Zhang, Yuxin Wang

TL;DR

The paper introduces Audio Spectrogram Mixer (ASM), a lighter Mixer-based alternative to the Audio Spectrogram Transformer (AST) for audio classification. By patching spectrograms into 16x16 tokens and using a 12-layer MLP-Mixer with multiple activation options, and by employing an RGB-to-grayscale-style input projection, ASM achieves competitive or superior accuracy with reduced parameter counts and training/inference costs. Across Speech Commands, UrbanSound8K, and CASIA datasets, ASM consistently outperforms or matches AST, especially when using Mixer-based Encoder replacement and the optimized input projection, with activation-function choices such as adapted Acon-C yielding strongest gains. The work demonstrates the potential of Mixer architectures in audio domains and outlines concrete paths for pretraining, self-supervised learning, and architectural refinements to further boost performance and efficiency.

Abstract

Transformer structures have demonstrated outstanding skills in the deep learning space recently, significantly increasing the accuracy of models across a variety of domains. Researchers have started to question whether such a sophisticated network structure is actually necessary and whether equally outstanding results can be reached with reduced inference cost due to its complicated network topology and high inference cost. In order to prove the Mixer's efficacy on three datasets Speech Commands, UrbanSound8k, and CASIA Chinese Sentiment Corpus this paper applies amore condensed version of the Mixer to an audio classification task and conducts comparative experiments with the Transformer-based Audio Spectrogram Transformer (AST)model. In addition, this paper conducts comparative experiments on the application of several activation functions in Mixer, namely GeLU, Mish, Swish and Acon-C. Further-more, the use of various activation functions in Mixer, including GeLU, Mish, Swish, and Acon-C, is compared in this research through comparison experiments. Additionally, some AST model flaws are highlighted, and the model suggested in this study is improved as a result. In conclusion, a model called the Audio Spectrogram Mixer, which is the first model for audio classification with Mixer, is suggested in this study and the model's future directions for improvement are examined.

ASM: Audio Spectrogram Mixer

TL;DR

The paper introduces Audio Spectrogram Mixer (ASM), a lighter Mixer-based alternative to the Audio Spectrogram Transformer (AST) for audio classification. By patching spectrograms into 16x16 tokens and using a 12-layer MLP-Mixer with multiple activation options, and by employing an RGB-to-grayscale-style input projection, ASM achieves competitive or superior accuracy with reduced parameter counts and training/inference costs. Across Speech Commands, UrbanSound8K, and CASIA datasets, ASM consistently outperforms or matches AST, especially when using Mixer-based Encoder replacement and the optimized input projection, with activation-function choices such as adapted Acon-C yielding strongest gains. The work demonstrates the potential of Mixer architectures in audio domains and outlines concrete paths for pretraining, self-supervised learning, and architectural refinements to further boost performance and efficiency.

Abstract

Transformer structures have demonstrated outstanding skills in the deep learning space recently, significantly increasing the accuracy of models across a variety of domains. Researchers have started to question whether such a sophisticated network structure is actually necessary and whether equally outstanding results can be reached with reduced inference cost due to its complicated network topology and high inference cost. In order to prove the Mixer's efficacy on three datasets Speech Commands, UrbanSound8k, and CASIA Chinese Sentiment Corpus this paper applies amore condensed version of the Mixer to an audio classification task and conducts comparative experiments with the Transformer-based Audio Spectrogram Transformer (AST)model. In addition, this paper conducts comparative experiments on the application of several activation functions in Mixer, namely GeLU, Mish, Swish and Acon-C. Further-more, the use of various activation functions in Mixer, including GeLU, Mish, Swish, and Acon-C, is compared in this research through comparison experiments. Additionally, some AST model flaws are highlighted, and the model suggested in this study is improved as a result. In conclusion, a model called the Audio Spectrogram Mixer, which is the first model for audio classification with Mixer, is suggested in this study and the model's future directions for improvement are examined.
Paper Structure (20 sections, 7 figures, 12 tables)