Table of Contents
Fetching ...

Synthesizing Programs for Images using Reinforced Adversarial Learning

Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, Oriol Vinyals

TL;DR

The paper tackles inverse graphics by learning a policy that synthesizes controllable visual programs executed by a renderer to match real images, without supervision. It introduces SPIRAL, an adversarial RL framework using a Wasserstein discriminator and a distributed actor-learner architecture to train a non-differentiable generator. The results demonstrate unsupervised end-to-end inverse graphics across MNIST, Omniglot, CelebA, and MuJoCo scenes, producing interpretable stroke-based decompositions and scene descriptions, and showing superiority of discriminator-based rewards over simple L2 losses. This work suggests a scalable path for visual program synthesis and inverse simulation, with promising avenues like MCTS and joint image–action discriminators for richer feedback.

Abstract

Advances in deep generative networks have led to impressive results in recent years. Nevertheless, such models can often waste their capacity on the minutiae of datasets, presumably due to weak inductive biases in their decoders. This is where graphics engines may come in handy since they abstract away low-level details and represent images as high-level programs. Current methods that combine deep learning and renderers are limited by hand-crafted likelihood or distance functions, a need for large amounts of supervision, or difficulties in scaling their inference algorithms to richer datasets. To mitigate these issues, we present SPIRAL, an adversarially trained agent that generates a program which is executed by a graphics engine to interpret and sample images. The goal of this agent is to fool a discriminator network that distinguishes between real and rendered data, trained with a distributed reinforcement learning setup without any supervision. A surprising finding is that using the discriminator's output as a reward signal is the key to allow the agent to make meaningful progress at matching the desired output rendering. To the best of our knowledge, this is the first demonstration of an end-to-end, unsupervised and adversarial inverse graphics agent on challenging real world (MNIST, Omniglot, CelebA) and synthetic 3D datasets.

Synthesizing Programs for Images using Reinforced Adversarial Learning

TL;DR

The paper tackles inverse graphics by learning a policy that synthesizes controllable visual programs executed by a renderer to match real images, without supervision. It introduces SPIRAL, an adversarial RL framework using a Wasserstein discriminator and a distributed actor-learner architecture to train a non-differentiable generator. The results demonstrate unsupervised end-to-end inverse graphics across MNIST, Omniglot, CelebA, and MuJoCo scenes, producing interpretable stroke-based decompositions and scene descriptions, and showing superiority of discriminator-based rewards over simple L2 losses. This work suggests a scalable path for visual program synthesis and inverse simulation, with promising avenues like MCTS and joint image–action discriminators for richer feedback.

Abstract

Advances in deep generative networks have led to impressive results in recent years. Nevertheless, such models can often waste their capacity on the minutiae of datasets, presumably due to weak inductive biases in their decoders. This is where graphics engines may come in handy since they abstract away low-level details and represent images as high-level programs. Current methods that combine deep learning and renderers are limited by hand-crafted likelihood or distance functions, a need for large amounts of supervision, or difficulties in scaling their inference algorithms to richer datasets. To mitigate these issues, we present SPIRAL, an adversarially trained agent that generates a program which is executed by a graphics engine to interpret and sample images. The goal of this agent is to fool a discriminator network that distinguishes between real and rendered data, trained with a distributed reinforcement learning setup without any supervision. A surprising finding is that using the discriminator's output as a reward signal is the key to allow the agent to make meaningful progress at matching the desired output rendering. To the best of our knowledge, this is the first demonstration of an end-to-end, unsupervised and adversarial inverse graphics agent on challenging real world (MNIST, Omniglot, CelebA) and synthetic 3D datasets.

Paper Structure

This paper contains 18 sections, 12 equations, 13 figures.

Figures (13)

  • Figure 1: SPIRAL takes as input either random noise or images and iteratively produces plausible samples or reconstructions via graphics program synthesis. The first row depicts an unconditional run given random noise. The second, third and fourth rows depict conditional execution given an image with a handwritten character, the Mona Lisa, and objects arranged in a 3D scene.
  • Figure 2: The SPIRAL architecture.(a) An execution trace of the SPIRAL agent. The policy outputs program fragments which are rendered into an image at each step via a graphics engine $\mathcal{R}$. The agent can make use of these intermediate renders to adjust its policy. The agent only receives a reward in the final step of execution. (b) Distributed training of SPIRAL. A collection of actors (in our experiments, up to 64), asynchronously and continuously produce execution traces. This data, along with a training dataset of ground-truth renderings, are passed to a Wasserstein discriminator on a separate GPU for adversarial training. The discriminator assesses the similarity of the final renderings of the traces to the ground-truth. A separate off-policy GPU learner receives batches of execution traces and trains the agent's parameters via policy-gradients to maximize the reward assigned to them by the discriminator, i.e., to match the distribution of the ground truth dataset.
  • Figure 3: Illustration of the agent's action space in the libmypaint environment. We show three different strokes (red, green, blue) that can result from a single instruction from the agent to the renderer. Starting from a position on the canvas, the agent selects the coordinates of the next end point, the coordinates of the intermediate control point, as well as the brush size, pressure and color. See Section \ref{['sect:environments']} for details.
  • Figure 4: MNIST.(a) A SPIRAL agent is trained to draw MNIST digits via a sequence of strokes in the libmypaint environment. As training progresses, the quality of the generations increases. The final samples capture the multi-modality of the dataset, varying brush sizes and digit styles. (b) A conditional SPIRAL agent is trained to reconstruct using the same action space. Reconstructions (left) match ground-truth (right) accurately.
  • Figure 5: Omniglot.(a) A SPIRAL agent is trained to draw MNIST digits via a sequence of strokes in the libmypaint environment. As training progresses, the quality of the generations increase. The final samples capture the multi-modality of the dataset, varying brush sizes and character styles. (b) A conditional SPIRAL agent is trained to reconstruct using the same action space. Reconstructions (left) match ground-truth (right) accurately.
  • ...and 8 more figures