Table of Contents
Fetching ...

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi

TL;DR

Infini-gram mini introduces an FM-index–based system capable of exact-match search on petabyte-scale text with a compact index (~0.44× of the corpus) and on-disk querying that minimizes RAM usage. The approach leverages parallelized indexing of SA, BWT, and ISA, with Huffman-shaped wavelet trees and document-boundary encoding to support fast counting and retrieval across shards. It demonstrates at-scale indexing of 83 TB in 99 days (or 19 hours with massive parallelism) and provides a web interface and API for practical search tasks. A major application is large-scale benchmark contamination analysis, revealing substantial contamination across widely used LM evaluation benchmarks and motivating the Benchmark Contamination Bulletin for ongoing monitoring. While providing substantial storage efficiency and scalability, the system trades off some document-retrieval latency compared with canonical suffix-array-based approaches and is limited to exact-match queries, suggesting avenues for future extensions to near-match or co-occurrence queries.

Abstract

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index

TL;DR

Infini-gram mini introduces an FM-index–based system capable of exact-match search on petabyte-scale text with a compact index (~0.44× of the corpus) and on-disk querying that minimizes RAM usage. The approach leverages parallelized indexing of SA, BWT, and ISA, with Huffman-shaped wavelet trees and document-boundary encoding to support fast counting and retrieval across shards. It demonstrates at-scale indexing of 83 TB in 99 days (or 19 hours with massive parallelism) and provides a web interface and API for practical search tasks. A major application is large-scale benchmark contamination analysis, revealing substantial contamination across widely used LM evaluation benchmarks and motivating the Benchmark Contamination Bulletin for ongoing monitoring. While providing substantial storage efficiency and scalability, the system trades off some document-retrieval latency compared with canonical suffix-array-based approaches and is limited to exact-match queries, suggesting avenues for future extensions to near-match or co-occurrence queries.

Abstract

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora - counting string appearances and retrieving the enclosing documents - yet the high storage overhead hinders their application on Internet-scale data. We present infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18) and memory use during both indexing (3.2 reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such nodes). We show one important use case of infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on infini-gram mini indexes.

Paper Structure

This paper contains 50 sections, 1 equation, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Overview of infini-gram mini. Based on the FM-index data structure, infini-gram mini supports efficient exact-match search in massive text corpora ($n \simeq 10^{15}$ bytes) while reducing the index size down to 7% compared to a canonical suffix array index. Searching naively in the corpus would have time complexity of $O(n)$ and is thus impractical; with infini-gram mini, the search time complexity is independent of $n$. $|Q|$ is the length of query string and can be arbitrarily long, and $H_0 \approx 2.1$ is the zeroth-order entropy of the text corpus.
  • Figure 2: The FM-index data structure (§\ref{['sec:data_structure']}) used in infini-gram mini, shown for a toy string with length $n = 7$. The suffix array is sampled with a sampling rate $a=3$ and only elements corresponding to bolded suffixes are stored. The BWT can be derived from the SA, and is stored in compressed form as a Huffman-shaped wavelet tree.
  • Figure 3: Examples of four contamination types. Violet text is the text overlap between benchmark entry and corpus. Magenta text is the mapping of answers.
  • Figure 4: Illustration of operations on FM-index (§\ref{['sec:operations']}, App. §\ref{['app:fm-query']}). Left:find operation computes the SA range corresponding to all occurrences of the pattern. Middle:locate operation computes the position of pattern occurrence in the original string for each position in the SA range. Right:reconstruct operation gets a substring of the original string enclosing the second pattern occurrence with a context length of 1. The occurrence ranking is based on its order in SA.
  • Figure 5: The web interface of infini-gram mini. Left: counting a string. Right: retrieving documents.
  • ...and 8 more figures