M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto
TL;DR
This work tackles the need for a general-purpose audio representation capable of both zero-shot inference and transfer learning. It introduces M2D-CLAP, a multitask framework that combines Masked Modeling Duo (M2D) with CLAP to align audio embeddings with a semantic text space, enabling robust audio-language representation. The approach demonstrates strong linear evaluation, fine-tuning, and zero-shot performance across diverse tasks, including a GTZAN zero-shot state-of-the-art of 75.17%, while also offering competitive transfer performance. By training on AudioSet with caption data and leveraging a shared semantic embedding, M2D-CLAP provides a versatile representation applicable to both traditional audio tasks and multimodal language-aligned tasks. The authors release code and a new caption dataset to facilitate future research in general-purpose audio-language representations.
Abstract
Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in both ZS and transfer learning. To do so, we propose a new method, M2D-CLAP, which combines self-supervised learning Masked Modeling Duo (M2D) and CLAP. M2D learns an effective representation to model audio signals, and CLAP aligns the representation with text embedding. As a result, M2D-CLAP learns a versatile representation that allows for both ZS and transfer learning. Experiments show that M2D-CLAP performs well on linear evaluation, fine-tuning, and ZS classification with a GTZAN state-of-the-art of 75.17%, thus achieving a general-purpose audio-language representation.
