Table of Contents
Fetching ...

MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy

Davis Yoshida, Kartik Goyal, Kevin Gimpel

TL;DR

This work argues that degenerate outputs in MAP decoding can arise even from perfectly trained language models due to data contamination with low-entropy distractors. It introduces attribute-conditional search as a principled way to recover high-quality conditional modes, and demonstrates both exact length-conditioned modes for MT and story completion and approximate conditional beam search (ACBS) to scale to larger models and practical tasks. Exact conditional modes are fluent and topical when length is constrained, while ACBS improves likelihood and human-judged quality over standard beam search, including enabling instruction-following behavior in LLaMA-7B without finetuning. The results suggest MAP-based decoding remains a promising direction for constrained generation, provided one conditions on informative attributes and uses efficient approximate methods. The findings have practical implications for controlled NLG and open avenues for safer, more reliable generation without extensive retraining.

Abstract

It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior work has attributed this behavior to either a fundamental and unavoidable inadequacy of modes in probabilistic models or weaknesses in language modeling. Contrastingly, we argue that degenerate modes can even occur in the absence of any modeling error, due to contamination of the training data. Specifically, we argue that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate. We therefore propose to apply MAP decoding to the model's true conditional distribution where the conditioning variable explicitly avoids specific degenerate behavior. Using exact search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, we observe that various kinds of degenerate modes persist, even at the scale of LLaMA-7B. Although we cannot tractably address these degeneracies with exact search, we perform a classifier-based approximate search on LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.

MAP's not dead yet: Uncovering true language model modes by conditioning away degeneracy

TL;DR

This work argues that degenerate outputs in MAP decoding can arise even from perfectly trained language models due to data contamination with low-entropy distractors. It introduces attribute-conditional search as a principled way to recover high-quality conditional modes, and demonstrates both exact length-conditioned modes for MT and story completion and approximate conditional beam search (ACBS) to scale to larger models and practical tasks. Exact conditional modes are fluent and topical when length is constrained, while ACBS improves likelihood and human-judged quality over standard beam search, including enabling instruction-following behavior in LLaMA-7B without finetuning. The results suggest MAP-based decoding remains a promising direction for constrained generation, provided one conditions on informative attributes and uses efficient approximate methods. The findings have practical implications for controlled NLG and open avenues for safer, more reliable generation without extensive retraining.

Abstract

It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior work has attributed this behavior to either a fundamental and unavoidable inadequacy of modes in probabilistic models or weaknesses in language modeling. Contrastingly, we argue that degenerate modes can even occur in the absence of any modeling error, due to contamination of the training data. Specifically, we argue that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution's mode to become degenerate. We therefore propose to apply MAP decoding to the model's true conditional distribution where the conditioning variable explicitly avoids specific degenerate behavior. Using exact search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, we observe that various kinds of degenerate modes persist, even at the scale of LLaMA-7B. Although we cannot tractably address these degeneracies with exact search, we perform a classifier-based approximate search on LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.
Paper Structure (51 sections, 5 equations, 5 figures, 12 tables)

This paper contains 51 sections, 5 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Geometric mean of the probabilities MarianMT Zh-En assigns to the empty sequence as a translation of inputs of varying lengths.
  • Figure 2: Rate at which the exact mode of the output distribution for (a) MarianMT Zh-En and (b) LLaMA-7B variants is empty.
  • Figure 3: Finetuned GPT-2-345M predictions of empty outputs on the ROC Stories validation set (1571 Stories). Stories are grouped into 5 equally sized bins by reference continuation length.
  • Figure 4: Re-centered log probability LLaMA-7B variants assign to the empty sequence averaged over 1000 inputs from the databricks-dolly-15k dataset. Each curve has been shifted so have a mean value of 0.
  • Figure 5: A comparison of classifier-guided conditional beam search and beam search when generating outputs that must be a certain length, using our finetuned GPT-2-345M model on the ROC Stories dev. set. See Table \ref{['tab:mt_winrate']} for interpretation.