Table of Contents
Fetching ...

A Frustratingly Simple Decoding Method for Neural Text Generation

Haoran Yang, Deng Cai, Huayang Li, Wei Bi, Wai Lam, Shuming Shi

TL;DR

Frustratingly Simple Decoding (FSD) introduces an on-the-fly anti-LM that penalizes repetitive content during neural text generation. By combining the standard LM score $p_{ heta}(v|x_{<t})$ with a penalty term $p_{oldsymbol{ extomega}}(v|x_{<t})$ via $ ext{FSD}(v|x_{<t}) = p_{ heta}(v|x_{<t}) - oldsymbol{ extalpha} imes p_{oldsymbol{ extomega}}(v|x_{<t})$, FSD can be instantiated with a discrete $n$-gram anti-LM or a vectorized variant that uses hidden states, enabling GPU acceleration. The method requires no extra model parameters and achieves near-greedy decoding speeds while improving generation quality, as shown by automatic and human evaluations across multiple datasets, languages, and tasks, including instruction following and summarization. Overall, FSD offers a universal, efficient decoding paradigm that mitigates degeneration in open-ended generation and demonstrates robust performance across LM families and domains.

Abstract

We introduce a frustratingly simple, super efficient and surprisingly effective decoding method, which we call Frustratingly Simple Decoding (FSD), for neural text generation. The idea behind FSD is straightforward: we build an anti-LM based on previously generated text and use this anti-LM to penalize future generation of what has been generated. The anti-LM can be implemented as simple as an n-gram language model or a vectorized variant. In this way, FSD introduces no extra model parameters and negligible computational overhead (FSD can be as fast as greedy search). Despite the simplicity, FSD is surprisingly effective; Experiments show that FSD can outperform the canonical methods to date (i.e., nucleus sampling) as well as several strong baselines that were proposed recently.

A Frustratingly Simple Decoding Method for Neural Text Generation

TL;DR

Frustratingly Simple Decoding (FSD) introduces an on-the-fly anti-LM that penalizes repetitive content during neural text generation. By combining the standard LM score with a penalty term via , FSD can be instantiated with a discrete -gram anti-LM or a vectorized variant that uses hidden states, enabling GPU acceleration. The method requires no extra model parameters and achieves near-greedy decoding speeds while improving generation quality, as shown by automatic and human evaluations across multiple datasets, languages, and tasks, including instruction following and summarization. Overall, FSD offers a universal, efficient decoding paradigm that mitigates degeneration in open-ended generation and demonstrates robust performance across LM families and domains.

Abstract

We introduce a frustratingly simple, super efficient and surprisingly effective decoding method, which we call Frustratingly Simple Decoding (FSD), for neural text generation. The idea behind FSD is straightforward: we build an anti-LM based on previously generated text and use this anti-LM to penalize future generation of what has been generated. The anti-LM can be implemented as simple as an n-gram language model or a vectorized variant. In this way, FSD introduces no extra model parameters and negligible computational overhead (FSD can be as fast as greedy search). Despite the simplicity, FSD is surprisingly effective; Experiments show that FSD can outperform the canonical methods to date (i.e., nucleus sampling) as well as several strong baselines that were proposed recently.
Paper Structure (45 sections, 5 equations, 4 figures, 21 tables, 2 algorithms)

This paper contains 45 sections, 5 equations, 4 figures, 21 tables, 2 algorithms.

Figures (4)

  • Figure 1: FSD exploits the contrasts between the LM and the anti-LM, where the probabilities from the LM and the anti-LM are used as rewards and penalties respectively. In the above example, the top prediction of the LM is "driving". However, the anti-LM also gives a large penalty to "driving" because it will result in repetition. Consequently, "wearing" is instead selected and the anti-LM is updated accordingly.
  • Figure 2: Diversity across different generation lengths.
  • Figure 3: Decoding latency tested on GPT2-XL.
  • Figure 4: Repetition rate for different $n$.