Table of Contents
Fetching ...

Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl

TL;DR

This work presents pGESAM, a two-stage semi-supervised framework that disentangles pitch and timbre within a 2D latent space and uses a Transformer to synthesize pitch-accurate instrument sounds conditioned on timbre. Stage 1 employs a pitch-aware VAE with a 2D latent representation and specialized losses to achieve clean pitch-timbre separation, while Stage 2 uses a Transformer to generate audio embeddings from the timbre latent and pitch input. Through NSynth experiments and an ablation study, the method demonstrates strong pitch control, coherent timbre navigation, and robust latent-space structure, and is complemented by an interactive web app for real-time exploration. The approach offers a practical, intuitive pathway for interactive music production that blends expressive timbre control with reliable pitch precision.

Abstract

This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com

Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space

TL;DR

This work presents pGESAM, a two-stage semi-supervised framework that disentangles pitch and timbre within a 2D latent space and uses a Transformer to synthesize pitch-accurate instrument sounds conditioned on timbre. Stage 1 employs a pitch-aware VAE with a 2D latent representation and specialized losses to achieve clean pitch-timbre separation, while Stage 2 uses a Transformer to generate audio embeddings from the timbre latent and pitch input. Through NSynth experiments and an ablation study, the method demonstrates strong pitch control, coherent timbre navigation, and robust latent-space structure, and is complemented by an interactive web app for real-time exploration. The approach offers a practical, intuitive pathway for interactive music production that blends expressive timbre control with reliable pitch precision.

Abstract

This paper presents a novel approach to neural instrument sound synthesis using a two-stage semi-supervised learning framework capable of generating pitch-accurate, high-quality music samples from an expressive timbre latent space. Existing approaches that achieve sufficient quality for music production often rely on high-dimensional latent representations that are difficult to navigate and provide unintuitive user experiences. We address this limitation through a two-stage training paradigm: first, we train a pitch-timbre disentangled 2D representation of audio samples using a Variational Autoencoder; second, we use this representation as conditioning input for a Transformer-based generative model. The learned 2D latent space serves as an intuitive interface for navigating and exploring the sound landscape. We demonstrate that the proposed method effectively learns a disentangled timbre space, enabling expressive and controllable audio generation with reliable pitch conditioning. Experimental results show the model's ability to capture subtle variations in timbre while maintaining a high degree of pitch accuracy. The usability of our method is demonstrated in an interactive web application, highlighting its potential as a step towards future music production environments that are both intuitive and creatively empowering: https://pgesam.faresschulz.com

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Main training paradigm of our approach.
  • Figure 2: Visualization of the latent space with different model configurations.
  • Figure 3: Visualization of the latent space of the proposed VAE model. The plot shows the predicted timbre latent mean vectors $\tilde{\mu}$ for all samples in the training set. Each point represents a sample. Instruments of the same family share a base color; the offsets in the color spectrum indicate different instrument ids.
  • Figure 4: Visualization of the latent space with different model configurations. Each subfigure shows the latent space of a VAE trained without one of the proposed components. Each point represents a sample. Instruments of the same family share a base color; the offsets in the color spectrum indicate different instrument ids.