Table of Contents
Fetching ...

Scaling up masked audio encoder learning for general audio classification

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

TL;DR

The paper tackles the problem of general audio classification across speech, music, and environmental sounds by scaling self-supervised learning with masked autoencoders. It introduces Dasheng, a 1.2B-parameter MAE-style encoder trained on 272,356 hours of diverse audio, employing 75% masking on Mel-spectrogram tokens and 25 Hz frame-level embeddings to learn rich representations. On the HEAR benchmark, Dasheng achieves strong results across multiple tasks, particularly in speech-related areas, and demonstrates notable cross-domain capabilities. The study shows that jointly scaling model size and data leads to meaningful performance gains, and that Dasheng embeddings are already powerful enough for competitive k-NN and linear evaluations without extensive fine-tuning. This work highlights MAE-based SSL as a practical path to general, scalable audio representations with broad applicability.

Abstract

Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.

Scaling up masked audio encoder learning for general audio classification

TL;DR

The paper tackles the problem of general audio classification across speech, music, and environmental sounds by scaling self-supervised learning with masked autoencoders. It introduces Dasheng, a 1.2B-parameter MAE-style encoder trained on 272,356 hours of diverse audio, employing 75% masking on Mel-spectrogram tokens and 25 Hz frame-level embeddings to learn rich representations. On the HEAR benchmark, Dasheng achieves strong results across multiple tasks, particularly in speech-related areas, and demonstrates notable cross-domain capabilities. The study shows that jointly scaling model size and data leads to meaningful performance gains, and that Dasheng embeddings are already powerful enough for competitive k-NN and linear evaluations without extensive fine-tuning. This work highlights MAE-based SSL as a practical path to general, scalable audio representations with broad applicability.

Abstract

Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.
Paper Structure (14 sections, 2 figures, 6 tables)

This paper contains 14 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Graph showcasing Dasheng's capability on the HEAR benchmark, compared to expert models CED-Base (Environment, Music) and Whisper-base (Speech), as well as baselines AudioMAE and Wav2Vec2. Best viewed in color.
  • Figure 2: The Dasheng training framework. Four consecutive Mel-spectrogram frames are "chunkified" into a single token. Following a linear transformation and the addition of a positional embedding, 75% of these chunked representations are discarded. The resulting tokens $\mathbf{V}$ are then fed into Dasheng, which extracts high-dimensional embeddings. During training, these embeddings are further fed into a small decoder responsible for predicting those chunks that were initially excluded.