Table of Contents
Fetching ...

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Johannes Schneider

TL;DR

The paper tackles improving next-token predictions by leveraging bidirectional context without full encoder–decoder training. It introduces a decoder-only second-to-last token predictor $f_s$ and couples it with a standard autoregressive predictor $f_n$ through a generate-then-refine algorithm that reweights top-$k$ candidates when $f_s$ correctly identifies the second-to-last token, using a weight $w$. Empirically, second-to-last token predictions achieve over $15\%$ higher accuracy than next-token predictions, and the refinement yields small but statistically significant gains across GPT-2 variants and multiple datasets; training efficiency is enhanced by a deterministic, mask-free masking strategy with subsequences of length $l=4$. The work demonstrates a practical self-correction mechanism in generation and outlines avenues for improvement by tuning $w$, $k$, and token-position strategies, potentially aiding more reliable language generation.

Abstract

Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

TL;DR

The paper tackles improving next-token predictions by leveraging bidirectional context without full encoder–decoder training. It introduces a decoder-only second-to-last token predictor and couples it with a standard autoregressive predictor through a generate-then-refine algorithm that reweights top- candidates when correctly identifies the second-to-last token, using a weight . Empirically, second-to-last token predictions achieve over higher accuracy than next-token predictions, and the refinement yields small but statistically significant gains across GPT-2 variants and multiple datasets; training efficiency is enhanced by a deterministic, mask-free masking strategy with subsequences of length . The work demonstrates a practical self-correction mechanism in generation and outlines avenues for improvement by tuning , , and token-position strategies, potentially aiding more reliable language generation.

Abstract

Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.

Paper Structure

This paper contains 10 sections, 3 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Conceptual outline of our "generate-then-refine" approach using second-to-last token prediction formalized in Algorithm \ref{['alg:comb']}
  • Figure 2: Example of how input sequences are processed to obtain sequences for training the second to last token prediction model $f_s$