Table of Contents
Fetching ...

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

Shahan Nercessian, Johannes Imort, Ninon Devis, Frederik Blang

TL;DR

This work addresses generating playable sample-based musical instruments from text or audio prompts using neural audio codec language models. It combines a neural audio codec with a transformer LM, conditioned on pitch, velocity, instrument family, source type, and a CLAP-based cross-modal embedding, and investigates three clap conditioning schemes. The authors introduce objective metrics for timbral consistency and prompt alignment, and demonstrate results on NSynth data showing a trade-off between expressivity and timbral stability, complemented by subjective MOS/MUSHRA evaluations. The approach broadens text-to-instrument capabilities and enables playable instrument generation from prompts, with implications for flexible sound design and future scalable models.

Abstract

In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

TL;DR

This work addresses generating playable sample-based musical instruments from text or audio prompts using neural audio codec language models. It combines a neural audio codec with a transformer LM, conditioned on pitch, velocity, instrument family, source type, and a CLAP-based cross-modal embedding, and investigates three clap conditioning schemes. The authors introduce objective metrics for timbral consistency and prompt alignment, and demonstrate results on NSynth data showing a trade-off between expressivity and timbral stability, complemented by subjective MOS/MUSHRA evaluations. The approach broadens text-to-instrument capabilities and enables playable instrument generation from prompts, with implications for flexible sound design and future scalable models.

Abstract

In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.
Paper Structure (20 sections, 11 equations, 3 figures, 4 tables)

This paper contains 20 sections, 11 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overview of our proposed system. Dotted lines represent frozen pretrained modules. Dashed lines denote steps exclusive to training. clap's audio or text head can be used at inference, disregarding source type and instrument family. Training operates on individual samples $\mathbf{x}$, while inference creates a set of samples $\hat{\mathbf{X}}$ from a consistent CLAP prompt and varied pitch/velocity cues to create a full instrument. Different piano keys/colors denote different pitches/velocities.
  • Figure 2: Covariance matrices for the text prompt $\mathbf{t}_k=$aggressive synth lead, computed using (a) naive replication, (b) translation, (c) coloration (matching the ground truth covariance $\mathbf{A}_{11,*}$ learned over the 53 instruments reflected in the NSynth validation/test sets), (d) cosine similarities relative to estimated $\hat{\rho}_k$/$\hat{\nu}_k$, corresponding to note E5/velocity 100.
  • Figure :