Learning Relationships Between Separate Audio Tracks for Creative Applications
Balthazar Bujard, Jérôme Nika, Fédéric Bevilacqua, Nicolas Obin
TL;DR
Problem: learn and tune the musical relationship between live input and real-time generated output in Musical Agents using a corpus of separated tracks. Approach: an offline architecture with a perception module (Wav2Vec 2.0-based), a Transformer-based decision module that predicts symbolic specifications from input, and an action module using Dicy2 with corpus-based concatenative synthesis; evaluation uses a re-generation task on MoisesDB and MICA. Key contributions: formalizes a three-module framework for relationship-driven MA, demonstrates offline learning of corpus-level musical relationships, and introduces a quantitative re-generation evaluation with true positive percentage and longest common prefix metrics; discusses the effects of vocabulary size, segmentation, and constrained generation. Significance: enables automatic learning of general musical relationships across corpora, improving customization and adaptability of musical agents for composition and improvisation; highlights practical considerations such as dataset size and perceptual representation in creative AI.
Abstract
This paper presents the first step in a research project situated within the field of musical agents. The objective is to achieve, through training, the tuning of the desired musical relationship between a live musical input and a real-time generated musical output, through the curation of a database of separated tracks. We propose an architecture integrating a symbolic decision module capable of learning and exploiting musical relationships from such musical corpus. We detail an offline implementation of this architecture employing Transformers as the decision module, associated with a perception module based on Wav2Vec 2.0, and concatenative synthesis as audio renderer. We present a quantitative evaluation of the decision module's ability to reproduce learned relationships extracted during training. We demonstrate that our decision module can predict a coherent track B when conditioned by its corresponding ''guide'' track A, based on a corpus of paired tracks (A, B).
