Table of Contents
Fetching ...

The Interpretation Gap in Text-to-Music Generation Models

Yongyi Zang, Yixiao Zhang

TL;DR

The paper identifies an interpretation gap in text-to-music generation where models struggle to map musician controls to outputs. It proposes a three-stage framework (expression, interpretation, execution) and argues that interpretation is the bottleneck hindering human-AI collaboration in music. To address this, it suggests two avenues: directly learning from diverse human-interpretation data and leveraging strong priors from large language models for musical interpretation, including pseudo-description techniques. The authors call on the music information retrieval community to focus research on interpretation to enable practical and creative human-AI collaboration.

Abstract

Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.

The Interpretation Gap in Text-to-Music Generation Models

TL;DR

The paper identifies an interpretation gap in text-to-music generation where models struggle to map musician controls to outputs. It proposes a three-stage framework (expression, interpretation, execution) and argues that interpretation is the bottleneck hindering human-AI collaboration in music. To address this, it suggests two avenues: directly learning from diverse human-interpretation data and leveraging strong priors from large language models for musical interpretation, including pseudo-description techniques. The authors call on the music information retrieval community to focus research on interpretation to enable practical and creative human-AI collaboration.

Abstract

Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
Paper Structure (8 sections, 2 figures, 2 tables)

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The comparison between human-human and human-AI interaction processes. We observe that the gap exists at both the interpretation stage and the execution stage, while the interpretation stage is often overlooked by current research.
  • Figure 2: The proposed model that describes musical interaction process.