Distributional Reinforcement Learning for Energy-Based Sequential Models
Tetiana Parshakova, Jean-Marc Andreoli, Marc Dymetman
TL;DR
The paper tackles how to extract efficient autoregressive samplers from energy-based sequential models (GAMs) that couple a local AM with a global potential. It reframes Training-2 as learning a distributional autoregressive policy via Distributional Policy Gradient (DPG), a general method that does not require sampling from the energy-based distribution. Through synthetic GAM experiments, it shows that DPG_off achieves data-efficient perplexity reduction comparable to distillation and can closely approximate the underlying energy-based distribution, even under varying feature sets. The two-stage GAM training clarifies why learning the energy representation can be easier than deriving a sampler, and the distributional RL perspective opens avenues for further improvements such as actor-critic approaches for sampling from EBMs.
Abstract
Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.
