Table of Contents
Fetching ...

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

TL;DR

This work revisits classical n-gram language modeling by scaling to 5 trillion tokens and extending n to infinity, enabling an ∞-gram LM backed by an efficient suffix-array–based infini-gram engine. The system delivers millisecond-scale queries on on-disk indexes and shows ∞-gram provides strong next-token predictions (47% on human-written text) and can substantially reduce neural LMs' perplexity when interpolated. Analyses reveal ∞-gram complements neural models and exposes dynamics in machine-generated text related to decoding and model size, highlighting areas where neural pretraining and positional embeddings may underperform. The authors release a public web interface, API, and Python tools to foster scalable, data-driven analysis of large text corpora and to support data curation, retrieval-augmented reasoning, and contamination detection in pretraining data.

Abstract

Are $n$-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

TL;DR

This work revisits classical n-gram language modeling by scaling to 5 trillion tokens and extending n to infinity, enabling an ∞-gram LM backed by an efficient suffix-array–based infini-gram engine. The system delivers millisecond-scale queries on on-disk indexes and shows ∞-gram provides strong next-token predictions (47% on human-written text) and can substantially reduce neural LMs' perplexity when interpolated. Analyses reveal ∞-gram complements neural models and exposes dynamics in machine-generated text related to decoding and model size, highlighting areas where neural pretraining and positional embeddings may underperform. The authors release a public web interface, API, and Python tools to foster scalable, data-driven analysis of large text corpora and to support data curation, retrieval-augmented reasoning, and contamination detection in pretraining data.

Abstract

Are -gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing -gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest -gram LM ever built. Second, existing -gram LMs use small which hinders their performance; we instead allow to be arbitrarily large, by introducing a new -gram LM with backoff. Instead of pre-computing -gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute -gram (as well as -gram with arbitrary ) probabilities with millisecond-level latency. The -gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the -gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine---gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.
Paper Structure (74 sections, 4 equations, 16 figures, 6 tables)

This paper contains 74 sections, 4 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: An example where a 5-gram LM gives an incorrect prediction but the $\infty$-gram gives the correct prediction by using the longest suffix of the prompt that has a non-zero count in the corpus. The counting and distribution estimate in $\infty$-gram LM are powered by our infini-gram engine.
  • Figure 2: Left: the suffix array for a toy string. Right: illustration of the suffix array in the infini-gram index, with $N = 4$ tokens in the dataset.
  • Figure 3: Token-wise agreement between human-written text and $n$-gram/$\infty$-gram LMs.
  • Figure 4: Distribution of probabilities assigned by neural LMs to human-written text tokens, and $\infty$-gram's agreement with these tokens. Takeaway:$\infty$-gram and neural LMs are predictive of actual human text on different tokens, and thus $\infty$-gram estimates -- especially sparse $\infty$-gram estimates -- can be used to complement neural LMs. See \ref{['fig:analysis_human_neural_ngram_more']} for extended results on Llama-2 13B/7B models.
  • Figure 5: Token-wise agreement between machine-generated text and $\infty$-gram. All tokens are considered. See \ref{['fig:analysis_machine_ngram_more']} for results on GPT-Neo models.
  • ...and 11 more figures