Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

Or Tal; Alon Ziv; Itai Gat; Felix Kreuk; Yossi Adi

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

Or Tal, Alon Ziv, Itai Gat, Felix Kreuk, Yossi Adi

TL;DR

JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls and is comparable to the evaluated baselines considering generation quality while allowing significantly better and more versatile controls over the generated music.

Abstract

We present JASCO, a temporally controlled text-to-music generation model utilizing both symbolic and audio-based conditions. JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls. JASCO is based on the Flow Matching modeling paradigm together with a novel conditioning method. This allows music generation controlled both locally (e.g., chords) and globally (text description). Specifically, we apply information bottleneck layers in conjunction with temporal blurring to extract relevant information with respect to specific controls. This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music model. We experiment with various symbolic control signals (e.g., chords, melody), as well as with audio representations (e.g., separated drum tracks, full-mix). We evaluate JASCO considering both generation quality and condition adherence, using both objective metrics and human studies. Results suggest that JASCO is comparable to the evaluated baselines considering generation quality while allowing significantly better and more versatile controls over the generated music. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/JASCO.

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

TL;DR

Abstract

Paper Structure (13 sections, 4 equations, 2 figures, 5 tables)

This paper contains 13 sections, 4 equations, 2 figures, 5 tables.

Introduction
Background
Method
Temporal Controls
Model and Optimization
Inference
Experimental Setup
Evaluation Metrics
Results
Analysis
Related Work
Discussion
Ethical statement

Figures (2)

Figure 1: Top figure presents the temporal blurring process, showcasing source separation, pooling and broadcasting. Bottom figure presents a high level presentation of Jasco. Conditions are first being projected to low dimensional representation and are concatenated over the channel dimensions. Green blocks have learnable parameters while blue block are frozen.
Figure 2: Comparison of v-Diffusion vs Flow Matching. We report FAD, KL, and CLAP on the internal dataset.

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

TL;DR

Abstract

Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)