Table of Contents
Fetching ...

Principled Gradient-based Markov Chain Monte Carlo for Text Generation

Li Du, Afra Amini, Lucas Torroba Hennigen, Xinyan Velocity Yu, Jason Eisner, Holden Lee, Ryan Cotterell

TL;DR

This paper proposes several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and proposes several that are faithful, meaning that they have the target text distribution as its limiting distribution.

Abstract

Recent papers have demonstrated the possibility of energy-based text generation by adapting gradient-based sampling algorithms, a paradigm of MCMC algorithms that promises fast convergence. However, as we show in this paper, previous attempts on this approach to text generation all fail to sample correctly from the target language model distributions. To address this limitation, we consider the problem of designing text samplers that are faithful, meaning that they have the target text distribution as its limiting distribution. We propose several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and study their theoretical properties. Through experiments on various forms of text generation, we demonstrate that faithful samplers are able to generate more fluent text while adhering to the control objectives better.

Principled Gradient-based Markov Chain Monte Carlo for Text Generation

TL;DR

This paper proposes several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and proposes several that are faithful, meaning that they have the target text distribution as its limiting distribution.

Abstract

Recent papers have demonstrated the possibility of energy-based text generation by adapting gradient-based sampling algorithms, a paradigm of MCMC algorithms that promises fast convergence. However, as we show in this paper, previous attempts on this approach to text generation all fail to sample correctly from the target language model distributions. To address this limitation, we consider the problem of designing text samplers that are faithful, meaning that they have the target text distribution as its limiting distribution. We propose several faithful gradient-based sampling algorithms to sample from the target energy-based text distribution correctly, and study their theoretical properties. Through experiments on various forms of text generation, we demonstrate that faithful samplers are able to generate more fluent text while adhering to the control objectives better.
Paper Structure (45 sections, 6 theorems, 61 equations, 3 figures, 2 tables)

This paper contains 45 sections, 6 theorems, 61 equations, 3 figures, 2 tables.

Key Result

theorem 1

Let $\pi(\vx)$ be a discrete log-quadratic distribution as defined in def:log-quad. For any $\alpha>0$, there exists a unique distribution $\pi_\alpha(\vx)$ such that the Markov chain defined by $q$ in eq:pncg-final-form is reversible with respect to $\pi_\alpha$. Further, $\pi_\alpha \to \pi$ weakl

Figures (3)

  • Figure 1: Total variation distance between $\pi_\mathrm{mcmc}$, the limiting distribution of MCMC algorithms from previous works, and $\pi_\toy$, the toy language model distribution from \ref{['ex:toy-lm']}. $\pi_\mathrm{mcmc}$ is computed with spectral decomposition when possible. We can observe that the limiting distribution of is far from the target distribution, and , depending on its step size $\alpha$, may be close to the target distribution. Nevertheless, it does not have the correct distribution for any $\alpha$.
  • Figure 2: Total variation distance between the empirical distribution of different samplers (at different steps) and $\pi_\toy$, the true distribution of the toy language model from \ref{['ex:toy-lm']}.
  • Figure 3: Energy traces of different samplers when sampling from GPT-2. We observe that the faithful samplers ( and GwL) converges to unbiased estimate of energy (estimated using ancestral sampling). On the other hand, the energy of the chain drops initially but suffers from systematic bias and is unable to converge to the true energy distribution, similar to the conclusion in our exact analysis in \ref{['ex:toy-lm']}.

Theorems & Definitions (11)

  • definition 1
  • theorem 1
  • proof : Proof Idea
  • theorem 2
  • proof : Proof Idea
  • theorem 2
  • proof
  • theorem 3: disc theorem; Theorem 6.1.1 in horn2013
  • theorem 3
  • proof
  • ...and 1 more