EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim, Jaeyoon Jung, Jinjoo Lee, Sang Hoon Woo
TL;DR
EnCLAP tackles automated audio captioning under data-scarce conditions by blending two pretrained acoustic representations—discrete EnCodec codes and CLAP sequence embeddings—with a pretrained BART decoder. A new training objective, Masked Codec Modeling (MCM), encourages the BART encoder to capture context among acoustic codes. Empirical results on AudioCaps and Clotho show state-of-the-art performance on AudioCaps and strong gains on Clotho, with the large model excelling in data-rich settings and showing sensitivity to data size. The work demonstrates that discrete neural codecs plus sequence-level acoustic features provide a more effective input to language models for AAC, and it releases code and a demo for reproducibility.
Abstract
We propose EnCLAP, a novel framework for automated audio captioning. EnCLAP employs two acoustic representation models, EnCodec and CLAP, along with a pretrained language model, BART. We also introduce a new training objective called masked codec modeling that improves acoustic awareness of the pretrained language model. Experimental results on AudioCaps and Clotho demonstrate that our model surpasses the performance of baseline models. Source code will be available at https://github.com/jaeyeonkim99/EnCLAP . An online demo is available at https://huggingface.co/spaces/enclap-team/enclap .
