Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
TL;DR
This work revisits classical n-gram language modeling by scaling to 5 trillion tokens and extending n to infinity, enabling an ∞-gram LM backed by an efficient suffix-array–based infini-gram engine. The system delivers millisecond-scale queries on on-disk indexes and shows ∞-gram provides strong next-token predictions (47% on human-written text) and can substantially reduce neural LMs' perplexity when interpolated. Analyses reveal ∞-gram complements neural models and exposes dynamics in machine-generated text related to decoding and model size, highlighting areas where neural pretraining and positional embeddings may underperform. The authors release a public web interface, API, and Python tools to foster scalable, data-driven analysis of large text corpora and to support data curation, retrieval-augmented reasoning, and contamination detection in pretraining data.
Abstract
Are $n$-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.
