Linear RNNs for autoregressive generation of long music samples
Konrad Szewczyk, Daniel Gallo Fernández, James Townsend
TL;DR
The paper tackles the challenge of autoregressive generation for long raw audio by introducing HarmonicRNN, a deep linear RNN built from a CG-LRU core and embedded in a temporal-block framework with multi-scale down/up pooling. This design enables context-parallel training on very long sequences (up to 1M tokens) and achieves state-of-the-art log-likelihood and perceptual metrics on small audio benchmarks, while using about 7.3M parameters. Key findings show that sinusoidal input embeddings and carefully tuned pooling (four groups) are crucial for stability and coherence over long timescales. The work demonstrates the viability of scalable, efficient linear RNNs for long-audio autoregressive generation and sets the stage for scaling to more complex audio tasks.
Abstract
Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.
