Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
TL;DR
Infini-gram mini introduces an FM-index–based system capable of exact-match search on petabyte-scale text with a compact index (~0.44× of the corpus) and on-disk querying that minimizes RAM usage. The approach leverages parallelized indexing of SA, BWT, and ISA, with Huffman-shaped wavelet trees and document-boundary encoding to support fast counting and retrieval across shards. It demonstrates at-scale indexing of 83 TB in 99 days (or 19 hours with massive parallelism) and provides a web interface and API for practical search tasks. A major application is large-scale benchmark contamination analysis, revealing substantial contamination across widely used LM evaluation benchmarks and motivating the Benchmark Contamination Bulletin for ongoing monitoring. While providing substantial storage efficiency and scalability, the system trades off some document-retrieval latency compared with canonical suffix-array-based approaches and is limited to exact-match queries, suggesting avenues for future extensions to near-match or co-occurrence queries.
Abstract
Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.
