Table of Contents
Fetching ...

Distributional Reinforcement Learning for Energy-Based Sequential Models

Tetiana Parshakova, Jean-Marc Andreoli, Marc Dymetman

TL;DR

The paper tackles how to extract efficient autoregressive samplers from energy-based sequential models (GAMs) that couple a local AM with a global potential. It reframes Training-2 as learning a distributional autoregressive policy via Distributional Policy Gradient (DPG), a general method that does not require sampling from the energy-based distribution. Through synthetic GAM experiments, it shows that DPG_off achieves data-efficient perplexity reduction comparable to distillation and can closely approximate the underlying energy-based distribution, even under varying feature sets. The two-stage GAM training clarifies why learning the energy representation can be easier than deriving a sampler, and the distributional RL perspective opens avenues for further improvements such as actor-critic approaches for sampling from EBMs.

Abstract

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.

Distributional Reinforcement Learning for Energy-Based Sequential Models

TL;DR

The paper tackles how to extract efficient autoregressive samplers from energy-based sequential models (GAMs) that couple a local AM with a global potential. It reframes Training-2 as learning a distributional autoregressive policy via Distributional Policy Gradient (DPG), a general method that does not require sampling from the energy-based distribution. Through synthetic GAM experiments, it shows that DPG_off achieves data-efficient perplexity reduction comparable to distillation and can closely approximate the underlying energy-based distribution, even under varying feature sets. The two-stage GAM training clarifies why learning the energy representation can be easier than deriving a sampler, and the distributional RL perspective opens avenues for further improvements such as actor-critic approaches for sampling from EBMs.

Abstract

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Two-stage training. At the end of the process, we compare the perplexities of $r$ and $\pi_\theta$ on test data: $CE(T,r)$ vs. $CE(T,\pi_\theta)$.
  • Figure 2: Distillation vs. DPG
  • Figure 3: snis vs. rs for Training-1. In Training-2, only distillation was used.
  • Figure 4: DPG vs. $p$
  • Figure 5: DPG vs Distillation with length feature on (top) or off (bottom).
  • ...and 2 more figures