Table of Contents
Fetching ...

Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine

TL;DR

The paper introduces the variational discriminator bottleneck (VDB), a mutual-information–based regularizer that constrains the information flow through the discriminator to stabilize adversarial learning. By inserting an encoder and enforcing $I(X;Z) \leq I_c$ with a learnable $\beta$, VDB yields more informative gradients across GANs, imitation learning (VAIL), and IRL (VAIRL). The authors demonstrate substantial improvements in motion imitation from raw video, transferable reward learning, and image generation stability, outperforming or matching state-of-the-art baselines. The approach provides a unified, adaptive mechanism for regularizing discriminators, with promising implications for broad adoption in adversarial learning tasks.

Abstract

Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator's internal representation, we can effectively modulate the discriminator's accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from \emph{raw} video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.

Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

TL;DR

The paper introduces the variational discriminator bottleneck (VDB), a mutual-information–based regularizer that constrains the information flow through the discriminator to stabilize adversarial learning. By inserting an encoder and enforcing with a learnable , VDB yields more informative gradients across GANs, imitation learning (VAIL), and IRL (VAIRL). The authors demonstrate substantial improvements in motion imitation from raw video, transferable reward learning, and image generation stability, outperforming or matching state-of-the-art baselines. The approach provides a unified, adaptive mechanism for regularizing discriminators, with promising implications for broad adoption in adversarial learning tasks.

Abstract

Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator's internal representation, we can effectively modulate the discriminator's accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from \emph{raw} video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.

Paper Structure

This paper contains 27 sections, 1 theorem, 37 equations, 18 figures, 4 tables.

Key Result

Theorem A.1

Let $g({\mathbf{u}})$ denote the generator's mapping from a noise vector ${\mathbf{u}} \sim p({\mathbf{u}})$ to a point in $X$. Given the generator distribution $G(\mathbf{x})$ and data distribution $p^*(\mathbf{x})$, a VDB with an encoder $E({\mathbf{z}} | {\mathbf{x}}) = \mathcal{N}(\mu_E({\mathbf where $D^*({\mathbf{z}})$ is the optimal discriminator, $a(\mathbf{x})$ and $b(\mathbf{x})$ are pos

Figures (18)

  • Figure 1: Our method is general and can be applied to a broad range of adversarial learning tasks. Left: Motion imitation with adversarial imitation learning. Middle: Image generation. Right: Learning transferable reward functions through adversarial inverse reinforcement learning.
  • Figure 2: Left: Overview of the variational discriminator bottleneck. The encoder first maps samples $\mathbf{x}$ to a latent distribution $E(\mathbf{z}|\mathbf{x})$. The discriminator is then trained to classify samples $\mathbf{z}$ from the latent distribution. An information bottleneck $I(X, Z) \leq I_c$ is applied to $Z$. Right: Visualization of discriminators trained to differentiate two Gaussians with different KL bounds $I_c$.
  • Figure 3: Simulated humanoid performing various skills. VAIL is able to closely imitate a broad range of skills from mocap data.
  • Figure 4: Learning curves comparing VAIL to other methods for motion imitation. Performance is measured using the average joint rotation error between the simulated character and the reference motion. Each method is evaluated with 3 random seeds.
  • Figure 5: Left: Snapshots of the video demonstration and the simulated character trained with VAIL. The policy learns to run by directly imitating the video. Right: Saliency maps that visualize the magnitude of the discriminator's gradient with respect to all channels of the RGB input images from both the demonstration and the simulation. Pixel values are normalized between $[0, 1]$.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Theorem A.1
  • proof
  • proof