EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal; Abid Hasan Zim; Md Asaduzzaman Tonmoy; Limengnan Zhou; Asad Malik; Minoru Kuribayashi

EAViT: External Attention Vision Transformer for Audio Classification

Aquib Iqbal, Abid Hasan Zim, Md Asaduzzaman Tonmoy, Limengnan Zhou, Asad Malik, Minoru Kuribayashi

TL;DR

This work addresses the challenge of accurate music-genre classification in large audio collections by introducing the External Attention Vision Transformer (EAViT), which integrates Multi-head External Attention (MEA) into the ViT encoder to capture long-range and cross-sample correlations. The model processes mel-spectrogram images derived from 3-second segments (from 30-second GTZAN clips) and employs learnable memory units $M_k$ and $M_v$ to facilitate external attention. EAViT achieves a state-of-the-art overall accuracy of 93.99% on GTZAN, outperforming ViT and several strong baselines, with detailed per-class performance and robust training dynamics. Overall, the approach demonstrates that cross-sample attentive mechanisms can substantially boost audio classification performance with potential applicability to broader audio tasks and datasets.

Abstract

This paper presents the External Attention Vision Transformer (EAViT) model, a novel approach designed to enhance audio classification accuracy. As digital audio resources proliferate, the demand for precise and efficient audio classification systems has intensified, driven by the need for improved recommendation systems and user personalization in various applications, including music streaming platforms and environmental sound recognition. Accurate audio classification is crucial for organizing vast audio libraries into coherent categories, enabling users to find and interact with their preferred audio content more effectively. In this study, we utilize the GTZAN dataset, which comprises 1,000 music excerpts spanning ten diverse genres. Each 30-second audio clip is segmented into 3-second excerpts to enhance dataset robustness and mitigate overfitting risks, allowing for more granular feature analysis. The EAViT model integrates multi-head external attention (MEA) mechanisms into the Vision Transformer (ViT) framework, effectively capturing long-range dependencies and potential correlations between samples. This external attention (EA) mechanism employs learnable memory units that enhance the network's capacity to process complex audio features efficiently. The study demonstrates that EAViT achieves a remarkable overall accuracy of 93.99%, surpassing state-of-the-art models.

EAViT: External Attention Vision Transformer for Audio Classification

TL;DR

and

to facilitate external attention. EAViT achieves a state-of-the-art overall accuracy of 93.99% on GTZAN, outperforming ViT and several strong baselines, with detailed per-class performance and robust training dynamics. Overall, the approach demonstrates that cross-sample attentive mechanisms can substantially boost audio classification performance with potential applicability to broader audio tasks and datasets.

Abstract

Paper Structure (11 sections, 11 equations, 6 figures, 2 tables)

This paper contains 11 sections, 11 equations, 6 figures, 2 tables.

Introduction
Methodology
Data
Multi-head external-attention (MEA)
Proposed Method
External Attention Vision Transformer (EAViT)
Experiments And Results
Evaluation Matrices
Environment
Analysis
Conclusion

Figures (6)

Figure 1: Raw waveform representation of an audio sample.
Figure 2: Mel spectrogram of an audio sample.
Figure 3: Multi-head external-attention (MEA).
Figure 4: Model Overview: EAViT.
Figure 5: proposed models: (a) Training and validation accuracy over epochs. (b) Training and validation loss over epochs.
...and 1 more figures

EAViT: External Attention Vision Transformer for Audio Classification

TL;DR

Abstract

EAViT: External Attention Vision Transformer for Audio Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (6)