Table of Contents
Fetching ...

MuLan: A Joint Embedding of Music Audio and Natural Language

Qingqing Huang, Aren Jansen, Joonseok Lee, Ravi Ganti, Judith Yue Li, Daniel P. W. Ellis

TL;DR

MuLan tackles linking music audio to unconstrained natural language by training a two-tower audio-text embedding with cross-modal contrastive learning on 44M music videos. It demonstrates zero-shot tagging, cross-modal retrieval, and language understanding in the music domain, with two audio encoders (ResNet-50 and AST) and a BERT text encoder. The approach yields state-of-the-art transfer learning for music tagging and enables flexible querying that extends beyond fixed ontologies, despite noisy textual data. The work highlights the value of large-scale weakly paired data and points to future improvements in text filtering and rare-language constructs.

Abstract

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

MuLan: A Joint Embedding of Music Audio and Natural Language

TL;DR

MuLan tackles linking music audio to unconstrained natural language by training a two-tower audio-text embedding with cross-modal contrastive learning on 44M music videos. It demonstrates zero-shot tagging, cross-modal retrieval, and language understanding in the music domain, with two audio encoders (ResNet-50 and AST) and a BERT text encoder. The approach yields state-of-the-art transfer learning for music tagging and enables flexible querying that extends beyond fixed ontologies, despite noisy textual data. The work highlights the value of large-scale weakly paired data and points to future improvements in text filtering and rare-language constructs.

Abstract

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.
Paper Structure (18 sections, 1 equation, 2 figures, 6 tables)