Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Johannes Schneider

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Johannes Schneider

TL;DR

The paper tackles improving next-token predictions by leveraging bidirectional context without full encoder–decoder training. It introduces a decoder-only second-to-last token predictor $f_s$ and couples it with a standard autoregressive predictor $f_n$ through a generate-then-refine algorithm that reweights top-$k$ candidates when $f_s$ correctly identifies the second-to-last token, using a weight $w$. Empirically, second-to-last token predictions achieve over $15\%$ higher accuracy than next-token predictions, and the refinement yields small but statistically significant gains across GPT-2 variants and multiple datasets; training efficiency is enhanced by a deterministic, mask-free masking strategy with subsequences of length $l=4$. The work demonstrates a practical self-correction mechanism in generation and outlines avenues for improvement by tuning $w$, $k$, and token-position strategies, potentially aiding more reliable language generation.

Abstract

Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach to masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We demonstrate on different variants of GPT-2 and different datasets that (not unexpectedly) second to last token predictions are much more accurate, i.e., more than 15\% higher accuracy than standard next token predictions. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

TL;DR

The paper tackles improving next-token predictions by leveraging bidirectional context without full encoder–decoder training. It introduces a decoder-only second-to-last token predictor

and couples it with a standard autoregressive predictor

through a generate-then-refine algorithm that reweights top-

candidates when

correctly identifies the second-to-last token, using a weight

. Empirically, second-to-last token predictions achieve over

higher accuracy than next-token predictions, and the refinement yields small but statistically significant gains across GPT-2 variants and multiple datasets; training efficiency is enhanced by a deterministic, mask-free masking strategy with subsequences of length

. The work demonstrates a practical self-correction mechanism in generation and outlines avenues for improvement by tuning

, and token-position strategies, potentially aiding more reliable language generation.

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

TL;DR

Abstract

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)