Table of Contents
Fetching ...

Live Music Models

Lyria Team, Antoine Caillon, Brian McWilliams, Cassie Tarakajian, Ian Simon, Ilaria Manco, Jesse Engel, Noah Constant, Yunpeng Li, Timo I. Denk, Alberto Lalama, Andrea Agostinelli, Cheng-Zhi Anna Huang, Ethan Manilow, George Brower, Hakan Erdogan, Heidi Lei, Itai Rolnick, Ivan Grishchenko, Manu Orsini, Matej Kastelic, Mauricio Zuluaga, Mauro Verzetti, Michael Dooley, Ondrej Skopek, Rafael Ferrer, Savvas Petridis, Zalán Borsos, Äaron van den Oord, Douglas Eck, Eli Collins, Jason Baldridge, Tom Hume, Chris Donahue, Kehang Han, Adam Roberts

TL;DR

Live music models address the challenge of real-time, interactive AI-assisted performance by enabling continuous streaming with synchronized user control. The authors present Magenta RealTime (open-weights on-device) and Lyria RealTime (API-based) built on a codec language modeling framework that combines SpectroStream tokenization and MusicCoCa style embeddings, implemented through a chunk-based encoder-decoder Transformer to sustain real-time throughput. Key contributions include a unified live LM architecture with chunked autoregression, audio-text style conditioning, and audio-injection mechanisms validated through objective metrics and a user study, along with publicly demo-ready open and API-based systems. The work demonstrates a practical pathway to on-device and cloud-based AI-assisted live music creation, with future directions targeting ultra-low latency, multi-stem collaboration, and richer control interfaces for performers.

Abstract

We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Live Music Models

TL;DR

Live music models address the challenge of real-time, interactive AI-assisted performance by enabling continuous streaming with synchronized user control. The authors present Magenta RealTime (open-weights on-device) and Lyria RealTime (API-based) built on a codec language modeling framework that combines SpectroStream tokenization and MusicCoCa style embeddings, implemented through a chunk-based encoder-decoder Transformer to sustain real-time throughput. Key contributions include a unified live LM architecture with chunked autoregression, audio-text style conditioning, and audio-injection mechanisms validated through objective metrics and a user study, along with publicly demo-ready open and API-based systems. The work demonstrates a practical pathway to on-device and cloud-based AI-assisted live music creation, with future directions targeting ultra-low latency, multi-stem collaboration, and richer control interfaces for performers.

Abstract

We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Paper Structure

This paper contains 47 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Magenta RealTime is a live music model that generates an uninterrupted stream of music and responds continuously to user input. It generates audio in two-second chunks using a pipeline with three components: (1) MusicCoCa, a style embedding model, (2) SpectroStream li2025spectrostream, an audio codec model, and (3) an encoder-decoder language model. For each chunk, a style embedding is computed via a weighted average of MusicCoCa embeddings of text and audio prompts from the user. Given this style embedding and $10$ seconds ($5$ chunks) of past audio context, the language model decoder generates SpectroStream audio tokens for the new chunk, which is then decoded to audio.
  • Figure 2: Prompt transition evaluation. Over $60$s, we transition from embeddings of text prompt A to B by stepwise linear interpolation. Left: Cosine similarity compared to the initial (blue) and final (red) text embedding. Right: Cosine similarity to the interpolation between text embeddings provided to the model. In both plots, lines indicate the mean and shaded regions the standard deviation.
  • Figure 3: Overall architecture of Magenta RT. Coarse acoustic tokens and quantized style tokens corresponding to 10s of audio context are concatenated and fed to the encoder part of our model. The decoder then predicts coarse and medium acoustic tokens corresponding to the the following 2 seconds.
  • Figure 4: Diagram of Lyria RT training / predicting conditioning tokens. Coarse acoustic tokens and quantized MuLan tokens are concatenated and fed to the encoder part of our language model. Control tokens, including BPM, stem balance, brightness, density and chromas (see Section \ref{['sec:descriptor_conditioning']}) are predicted first by the decoder, followed by coarse and medium level acoustic tokens. Finally, a small refinement model predicts the fine-scale acoustic tokens as described in Section \ref{['sec:refinement']}.
  • Figure 5: Control priors for self-conditioning. The predicted logits (likelihood) for the control tokens are shifted by a user defined prior dictated by the control values. These are combined to give the final posterior logits that are used for sampling, and steer the model outputs in the direction of the user controls.
  • ...and 3 more figures