Scaling up masked audio encoder learning for general audio classification
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang
TL;DR
The paper tackles the problem of general audio classification across speech, music, and environmental sounds by scaling self-supervised learning with masked autoencoders. It introduces Dasheng, a 1.2B-parameter MAE-style encoder trained on 272,356 hours of diverse audio, employing 75% masking on Mel-spectrogram tokens and 25 Hz frame-level embeddings to learn rich representations. On the HEAR benchmark, Dasheng achieves strong results across multiple tasks, particularly in speech-related areas, and demonstrates notable cross-domain capabilities. The study shows that jointly scaling model size and data leads to meaningful performance gains, and that Dasheng embeddings are already powerful enough for competitive k-NN and linear evaluations without extensive fine-tuning. This work highlights MAE-based SSL as a practical path to general, scalable audio representations with broad applicability.
Abstract
Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.
