The challenge of realistic music generation: modelling raw audio at scale
Sander Dieleman, Aäron van den Oord, Karen Simonyan
TL;DR
This work tackles realistic music generation by modeling raw audio with autoregressive discrete autoencoders (ADAs) to capture long-range structure across multiple timescales. It introduces two quantization approaches, VQ-VAE and AMAE, and demonstrates hierarchical, multi-level architectures that extend effective receptive fields to tens of seconds of piano audio. The study finds that ADAs can produce musically coherent samples with substantial long-range coherence, though there is a fidelity trade-off, and AMAE offers more robust training without codebook collapse. The results highlight the importance of multi-scale conditioning for raw-audio music generation and point to future avenues in high-level conditioning and multi-instrument extensions.
Abstract
Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find that they allow us to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.
