Table of Contents
Fetching ...

Learning Relationships Between Separate Audio Tracks for Creative Applications

Balthazar Bujard, Jérôme Nika, Fédéric Bevilacqua, Nicolas Obin

TL;DR

Problem: learn and tune the musical relationship between live input and real-time generated output in Musical Agents using a corpus of separated tracks. Approach: an offline architecture with a perception module (Wav2Vec 2.0-based), a Transformer-based decision module that predicts symbolic specifications from input, and an action module using Dicy2 with corpus-based concatenative synthesis; evaluation uses a re-generation task on MoisesDB and MICA. Key contributions: formalizes a three-module framework for relationship-driven MA, demonstrates offline learning of corpus-level musical relationships, and introduces a quantitative re-generation evaluation with true positive percentage and longest common prefix metrics; discusses the effects of vocabulary size, segmentation, and constrained generation. Significance: enables automatic learning of general musical relationships across corpora, improving customization and adaptability of musical agents for composition and improvisation; highlights practical considerations such as dataset size and perceptual representation in creative AI.

Abstract

This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).

Learning Relationships Between Separate Audio Tracks for Creative Applications

TL;DR

Problem: learn and tune the musical relationship between live input and real-time generated output in Musical Agents using a corpus of separated tracks. Approach: an offline architecture with a perception module (Wav2Vec 2.0-based), a Transformer-based decision module that predicts symbolic specifications from input, and an action module using Dicy2 with corpus-based concatenative synthesis; evaluation uses a re-generation task on MoisesDB and MICA. Key contributions: formalizes a three-module framework for relationship-driven MA, demonstrates offline learning of corpus-level musical relationships, and introduces a quantitative re-generation evaluation with true positive percentage and longest common prefix metrics; discusses the effects of vocabulary size, segmentation, and constrained generation. Significance: enables automatic learning of general musical relationships across corpora, improving customization and adaptability of musical agents for composition and improvisation; highlights practical considerations such as dataset size and perceptual representation in creative AI.

Abstract

This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).

Paper Structure

This paper contains 26 sections, 6 equations, 8 figures.

Figures (8)

  • Figure 1: Online vs. offline framework for relationship-based Musical Agents.
  • Figure 2: Overview of the proposed relationship-based Musical Agent architecture. The pipeline is divided into three modules. The Perception module encodes an audio input $\vec{\textbf{x}}$ into a sub-sampled, quantized symbolic sequence $\vec{\textbf{x}_q}$ using a pre-trained encoder, a temporal condenser, and a vector quantizer. The Decision module generates a symbolic response $\vec{\textbf{y}}_q$ based on $\vec{\textbf{x}}_q$ through a sequence decoder and token selection. Finally, the Action module reconstructs audio from $\vec{\textbf{y}}_q$ via corpus-based concatenative synthesis. The synthesis corpus is a user-defined audio file segmented and encoded through the same Perception module.
  • Figure 3: Training procedure for learning symbolic musical relationships.
  • Figure 4: True Positive Percentage (TPP) of the re-generation task for MoisesDB (a) and MICA (b). Model configurations are grouped by alphabet size and ordered by segmentation duration. Random baseline (gray) is plotted on the right of its corresponding configuration. Mann-Whitney statistical test returns p-value<0.0001 for all configurations.
  • Figure 5: True Positive Percentage (TPP) of the re-generation task for MoisesDB (a) and MICA (b) under constrained generation setup. Model configurations are grouped by alphabet size and ordered by segmentation duration. Random baseline (gray) is plotted on the right of its corresponding configuration. Mann-Whitney statistical test returns p-value<0.0001 for all configurations.
  • ...and 3 more figures