Table of Contents
Fetching ...

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

Leonardo Pepino, Pablo Riera, Luciana Ferrer

TL;DR

This work introduces EnCodecMAE, a universal audio representation model that masks frame-level inputs and predicts discrete units produced by a neural audio codec (EnCodec). By combining a masked autoencoder architecture with EnCodec RVQ targets and a self-training stage based on k-means clustering, the approach achieves strong performance across speech, music, and environmental tasks and demonstrates competitive ASR results under a standardized protocol. The method emphasizes frame-level processing, avoids patching, and demonstrates that pretraining with a mixture of audio datasets plus self-training yields robust, multi-domain representations with practical efficiency. Overall, EnCodecMAE advances universal audio representation learning by integrating neural-codec targets, efficient MAE-based training, and scalable evaluation on diverse tasks, including ASR.

Abstract

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

TL;DR

This work introduces EnCodecMAE, a universal audio representation model that masks frame-level inputs and predicts discrete units produced by a neural audio codec (EnCodec). By combining a masked autoencoder architecture with EnCodec RVQ targets and a self-training stage based on k-means clustering, the approach achieves strong performance across speech, music, and environmental tasks and demonstrates competitive ASR results under a standardized protocol. The method emphasizes frame-level processing, avoids patching, and demonstrates that pretraining with a mixture of audio datasets plus self-training yields robust, multi-domain representations with practical efficiency. Overall, EnCodecMAE advances universal audio representation learning by integrating neural-codec targets, efficient MAE-based training, and scalable evaluation on diverse tasks, including ASR.

Abstract

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.
Paper Structure (11 sections, 1 equation, 1 figure, 2 tables)

This paper contains 11 sections, 1 equation, 1 figure, 2 tables.

Figures (1)

  • Figure 1: EnCodecMAE architecture. Features $X_f$ are extracted from an audio signal $X_a$, projected to the model dimensionality $D$, and positional embeddings are added. A percentage of the frames are masked and discarded, and the resulting sequence $X_v$ is processed by the MAE encoder. Before feeding the encoder output $X_e$ to the decoder, mask tokens are inserted in the positions that were dropped. The loss is finally computed between the posteriors $\hat{Y}$ generated by the MAE's decoder and the discrete targets $Y$ produced by EnCodec.