Table of Contents
Fetching ...

Designing Neural Synthesizers for Low-Latency Interaction

Franco Caspe, Jordie Shier, Mark Sandler, Charalampos Saitis, Andrew McPherson

TL;DR

This work centers latency as a fundamental design constraint for neural audio synthesis in musical interaction. It analyzes latency sources in existing NAS models, with a detailed case study of RAVE, and then iteratively redesigns the architecture to achieve low-latency, low-jitter real-time inference. The result is BRAVE, a causal, low-latency variational autoencoder that supports timbre transfer with preserved content and competitive audio quality, demonstrated via a proof-of-concept plugin and an open-source evaluation toolkit. The findings underscore the importance of receptive field and temporal trajectories in latent representations for enabling rich, low-latency musical interaction, and offer practical guidelines for building future real-time NAS systems. The work thus advances interactive DSP-style NAS capabilities while providing actionable benchmarks and tooling for researchers and designers.

Abstract

Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.

Designing Neural Synthesizers for Low-Latency Interaction

TL;DR

This work centers latency as a fundamental design constraint for neural audio synthesis in musical interaction. It analyzes latency sources in existing NAS models, with a detailed case study of RAVE, and then iteratively redesigns the architecture to achieve low-latency, low-jitter real-time inference. The result is BRAVE, a causal, low-latency variational autoencoder that supports timbre transfer with preserved content and competitive audio quality, demonstrated via a proof-of-concept plugin and an open-source evaluation toolkit. The findings underscore the importance of receptive field and temporal trajectories in latent representations for enabling rich, low-latency musical interaction, and offer practical guidelines for building future real-time NAS systems. The work thus advances interactive DSP-style NAS capabilities while providing actionable benchmarks and tooling for researchers and designers.

Abstract

Neural Audio Synthesis (NAS) models offer interactive musical control over high-quality, expressive audio generators. While these models can operate in real-time, they often suffer from high latency, making them unsuitable for intimate musical interaction. The impact of architectural choices in deep learning models on audio latency remains largely unexplored in the NAS literature. In this work, we investigate the sources of latency and jitter typically found in interactive NAS models. We then apply this analysis to the task of timbre transfer using RAVE, a convolutional variational autoencoder for audio waveforms introduced by Caillon et al. in 2021. Finally, we present an iterative design approach for optimizing latency. This culminates with a model we call BRAVE (Bravely Realtime Audio Variational autoEncoder), which is low-latency and exhibits better pitch and loudness replication while showing timbre modification capabilities similar to RAVE. We implement it in a specialized inference framework for low-latency, real-time inference and present a proof-of-concept audio plugin compatible with audio signals from musical instruments. We expect the challenges and guidelines described in this document to support NAS researchers in designing models for low-latency inference from the ground up, enriching the landscape of possibilities for musicians.

Paper Structure

This paper contains 28 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Simplified architectural comparison. BRAVE achieves adequate latency ($<10$ ms) and jitter ( 3 ms) by removing RAVE's noise generator and using a smaller encoder compression ratio, PQMF attenuation, and causal training, reducing its buffering, representation, and cumulative delays respectively. The number of parameters is also reduced to improve its RTF (see Table \ref{['tab:models']} ). Numbers in monospace denote the compression ratio of intermediate results.
  • Figure 2: We compute the MMD distance between the timbre transfer results of each model, and the input (top-row) and target instrument (bottom-row) datasets. Along the horizontal, entries in bold show dataset cross-similarity and self-similarity. Following this, MMD results for timbre transfer of all model variants. Lower values for the target instrument indicate closer alignment to the target distribution and therefore, a more successful timbre transfer.