Table of Contents
Fetching ...

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

Olof Mogren

TL;DR

The paper introduces C-RNN-GAN, a continuous-sequence GAN with a generator that outputs real-valued tone events and a bidirectional LSTM discriminator, trained to model the joint distribution of musical sequences. It demonstrates that adversarial training increases variability and tonal spread in generated classical music, and that allowing multiple tones per step enhances polyphony (notably in the 3-tone variant). While generated samples move closer to real music on several statistics compared to a maximum-likelihood baseline, they do not yet match human judgments of realism. The work provides a foundation for applying adversarial training to continuous sequential data and highlights the importance of stabilization techniques and multi-tone outputs in improving musicality.

Abstract

Generative adversarial networks have been proposed as a way of efficiently training deep generative neural networks. We propose a generative adversarial model that works on continuous sequential data, and apply it by training it on a collection of classical music. We conclude that it generates music that sounds better and better as the model is trained, report statistics on generated music, and let the reader judge the quality by downloading the generated songs.

C-RNN-GAN: Continuous recurrent neural networks with adversarial training

TL;DR

The paper introduces C-RNN-GAN, a continuous-sequence GAN with a generator that outputs real-valued tone events and a bidirectional LSTM discriminator, trained to model the joint distribution of musical sequences. It demonstrates that adversarial training increases variability and tonal spread in generated classical music, and that allowing multiple tones per step enhances polyphony (notably in the 3-tone variant). While generated samples move closer to real music on several statistics compared to a maximum-likelihood baseline, they do not yet match human judgments of realism. The work provides a foundation for applying adversarial training to continuous sequential data and highlights the importance of stabilization techniques and multi-tone outputs in improving musicality.

Abstract

Generative adversarial networks have been proposed as a way of efficiently training deep generative neural networks. We propose a generative adversarial model that works on continuous sequential data, and apply it by training it on a collection of classical music. We conclude that it generates music that sounds better and better as the model is trained, report statistics on generated music, and let the reader judge the quality by downloading the generated songs.

Paper Structure

This paper contains 8 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: C-RNN-GAN. The generator ($G$) produces sequences of continuous data events. The discriminator ($D$) is trained to distinguish between real music data and generated data.
  • Figure 2: Music generated with C-RNN-GAN with feature matching and three tones per cell.
  • Figure 3: Statistics of generated music from the evaluated models. C-RNN-GAN (\ref{['fig:results-c-rnn-gan']}) generates music with increasing complexity as training proceeds. The number of unique tones used is has a vaguely increasing trend, while the scale consistency seems to stabilize after ten or fifteen epochs. The 3-tone repetition has an increasing trend the first 25 epochs, and then stays on quite a low level, seemingly correlated with the number of tones used. The baseline model (\ref{['fig:results-baseline']}) does not reach the same level of variation. The number of unique tones used is consistently much lower while the scale consistency seems to be similar to C-RNN-GAN. The tone span follows number of unique tones more closely than with C-RNN-GAN, suggesting that the baseline has less variability in the tones used. The C-RNN-GAN-3 (\ref{['fig:results-c-rnn-gan-3']}) obtains a higher polyphony score, in contrast to both C-RNN-GAN and the baseline. After reaching a state with many zero-valued outputs around epoch 50 to 55, C-RNN-GAN-3 reaches substantially higher values on tone span, number of unique tones, intensity span, and 3 tone repetitions. In (\ref{['fig:results-real-music']}), one can see that real music has an intensity span similar to that of the generated music. Scale consistency is slightly higher, but also varies more. The polyphony score is similar to C-RNN-GAN-3. 3-tone repetitions are much higher, but is difficult to compare as the songs are of different length. The count is normalized by dividing by $l_r/l_g$, where $l_r$ is the length of the real music, and $l_g$ is the length of the generated music.