EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Wenxi Chen; Yuzhe Liang; Ziyang Ma; Zhisheng Zheng; Xie Chen

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

TL;DR

The paper tackles the high computational cost of self-supervised audio pre-training by introducing Efficient Audio Transformer (EAT), which uses a novel Utterance-Frame Objective (UFO) to fuse global and local audio representations. It combines bootstrap self-supervision with an 80% inverse block masking strategy and a lightweight CNN decoder to achieve fast pre-training without sacrificing performance. EAT achieves state-of-the-art results on AudioSet (AS-2M, AS-20K) and ESC-50, and competitive performance on SPC-2, while delivering substantial speedups (up to ~15x) over prior audio SSL models. The work also demonstrates the benefits of CLS-token-based utterance predictions and ablation analyses reveal the effectiveness of UFO, block masking, and balancing utterance/frame losses for robust audio representation learning.

Abstract

Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

TL;DR

Abstract

Paper Structure (26 sections, 4 equations, 4 figures, 4 tables)

This paper contains 26 sections, 4 equations, 4 figures, 4 tables.

Introduction
Related Work
Bootstrap Method
Self-supervised Audio Pre-training
Method
Model Architecture
Patch Embedding with Positional Encoding.
Utterance-Frame Objective
Masking Strategies in Pre-training
Pre-training Details
Fine-tuning Details
Experiments
Experimental Setups
AudioSet (AS-2M, AS-20K).
Environmental Sound Classification (ESC-50).
...and 11 more sections

Figures (4)

Figure 1: Architecture of EAT in Audio Self-supervised Pre-training. EAT first transforms the audio spectrogram into patch embeddings with a CNN encoder. They are then separately fed into the student model via the inverse block multi-mask method and the teacher model with the same network directly. Subsequently, the generated features merged with the masked parts, are decoded using a lightweight CNN decoder. The teacher model synthesizes the average output from all Transformer layers as the target value. The utterance-level loss utilizes regression on the mean pooling values of the target values across patch dimensions, while the frame-level loss uses regression on target values at masked positions. The teacher model is updated through the EMA method, based on the learnable parameters of the student model. Notably, "sg" means stop-gradient here.
Figure 2: Inverse Block Masking on Audio Patches. The block size is set to $2 \times 2$ with a masking ratio of $80\%$ in the right subfigure.
Figure 3: Comparison with $\text{BEATs}_{iter3}$ and Audio-MAE on pre-training epoch during EAT's 10-epoch pre-training. All models are uniformly fine-tuned on AS-20K and tested on the evaluation set.
Figure 4: Comparison on Utterance-level Loss Weight $\lambda$ in Pre-training and Prediction Methods in Fine-tuning. During fine-tuning, we compare the effect of the final prediction on using the CLS token and mean pooling over all frames.

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

TL;DR

Abstract

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (4)