Table of Contents
Fetching ...

Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models

Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, Alán Aspuru-Guzik

TL;DR

The paper tackles controlled unsupervised sequence generation by combining GANs with reinforcement learning to bias samples toward domain-specific metrics while preserving resemblance to the data distribution. It introduces ORGAN, where the generator optimizes a reward that blends a discriminator signal with objective-based scores, augmented by a diversity penalty and optional Wasserstein-based stability. Experiments on SMILES molecules and MIDI-based melodies show ORGAN can improve target properties and maintain diversity, outperforming MLE, SeqGAN, and naive RL in many settings. The results suggest ORGAN offers a practical, black-box approach to guided sequence generation and points to extensions to non-sequential data.

Abstract

In unsupervised data generation tasks, besides the generation of a sample based on previous observations, one would often like to give hints to the model in order to bias the generation towards desirable metrics. We propose a method that combines Generative Adversarial Networks (GANs) and reinforcement learning (RL) in order to accomplish exactly that. While RL biases the data generation process towards arbitrary metrics, the GAN component of the reward function ensures that the model still remembers information learned from data. We build upon previous results that incorporated GANs and RL in order to generate sequence data and test this model in several settings for the generation of molecules encoded as text sequences (SMILES) and in the context of music generation, showing for each case that we can effectively bias the generation process towards desired metrics.

Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models

TL;DR

The paper tackles controlled unsupervised sequence generation by combining GANs with reinforcement learning to bias samples toward domain-specific metrics while preserving resemblance to the data distribution. It introduces ORGAN, where the generator optimizes a reward that blends a discriminator signal with objective-based scores, augmented by a diversity penalty and optional Wasserstein-based stability. Experiments on SMILES molecules and MIDI-based melodies show ORGAN can improve target properties and maintain diversity, outperforming MLE, SeqGAN, and naive RL in many settings. The results suggest ORGAN offers a practical, black-box approach to guided sequence generation and points to extensions to non-sequential data.

Abstract

In unsupervised data generation tasks, besides the generation of a sample based on previous observations, one would often like to give hints to the model in order to bias the generation towards desirable metrics. We propose a method that combines Generative Adversarial Networks (GANs) and reinforcement learning (RL) in order to accomplish exactly that. While RL biases the data generation process towards arbitrary metrics, the GAN component of the reward function ensures that the model still remembers information learned from data. We build upon previous results that incorporated GANs and RL in order to generate sequence data and test this model in several settings for the generation of molecules encoded as text sequences (SMILES) and in the context of music generation, showing for each case that we can effectively bias the generation process towards desired metrics.

Paper Structure

This paper contains 9 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Schema for ORGAN. Left: $D$ is trained as a classifier receiving as input a mix of real data and generated data by $G$. Right: $G$ is trained by RL where the reward is a combination of $D$ and the objectives, and is passed back to the policy function via Monte Carlo sampling. We penalize non-unique sequences.
  • Figure 2: Violinplots of Druglikeliness for molecules from the baseline Dataset(n=5000) and optimized OR(W)GAN (n=5440).
  • Figure 3: Plots of each objective across the training epochs. Objectives were trained for one epoch, and then switched for another.
  • Figure 4: Plots of Diversity and Tonality rewards (the latter re-scaled to the [0, 1] interval) after 80 epochs of training on the music generation task. The upper plot employs the classical GAN loss, while the lower displays a WGAN. The values have been averaged over 1000 samples.