Language models are better than humans at next-token prediction

Buck Shlegeris; Fabien Roger; Lawrence Chan; Euan McLean

Language models are better than humans at next-token prediction

Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

TL;DR

The paper directly compares humans and language models on next-token prediction using two metrics: top-1 accuracy and perplexity, on the OpenWebText corpus. Across experiments, humans underperform even small language models (e.g., GPT-Neo-125M, GPT-2 variants), challenging assumptions about human superiority in language tasks. The authors implement careful methodological controls, including importance sampling and bias corrections, to enable apples-to-apples perplexity comparisons and discuss limitations like tokenization and calibration. Overall, the work shows language models achieve superhuman performance at next-token prediction, with implications for interpretability and alignment in AI systems.

Abstract

Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.

Language models are better than humans at next-token prediction

TL;DR

Abstract

Paper Structure (19 sections, 9 equations, 8 figures, 1 table)

This paper contains 19 sections, 9 equations, 8 figures, 1 table.

Introduction
Related work
Measuring human top-1 accuracy
Method
Results
Additional analysis
Measuring human perplexity
Method
Estimating perplexity from a few relative probabilities with importance sampling
Controlling for sample bias in importance sampling
Estimating uncertainty in our measurement of perplexity
Estimating perplexity for language models
Results and limitations
Discussion
Conclusion
...and 4 more sections

Figures (8)

Figure 1: The distribution of human top-1 accuracy (how often a human guesses the correct next token given previous tokens) found in our study, with GPT-Neo-125M, GPT-Neo-1.3B, GPT-J (6B) and GPT-3 (175B) for comparison.
Figure 2: The distribution of human top-1 accuracy on a filtered dataset made out of single-word tokens, with GPT-Neo and GPT-J models of varying sizes for comparison.
Figure 4: Our estimated perplexities for a number of language models and humans. The human perplexity was obtained from our study described in Section \ref{['sec:perplexity_method']}. GPT-2 small has no estimated value, since this was used as the reference generator model. Error bars are determined according to uncertainty source (i) described in Section \ref{['sec:uncertainties']}.
Figure 5: Interface for the experiment described in Section \ref{['sec:human_top1accuracy']}. The interface is available at https://rr-lm-game.herokuapp.com
Figure 6: Interface for the experiment described in Section \ref{['sec:human_perplexity']}. The interface is available at https://rr-lm-game.herokuapp.com/whichonescored.
...and 3 more figures

Language models are better than humans at next-token prediction

TL;DR

Abstract

Language models are better than humans at next-token prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)