Koala: An Index for Quantifying Overlaps with Pre-training Corpora
Thuy-Trang Vu, Xuanli He, Gholamreza Haffari, Ehsan Shareghi
TL;DR
Koala presents a scalable, public index over large pre-training corpora using compressed suffix arrays (FM-Index) to quantify overlaps with training data. It implements a full pre-processing pipeline (cleaning, deduplication with MinHashLSH, and Moses tokenization) and builds per-corpus indexes covering multiple public sources to enable exact-match n-gram overlap analysis. By defining metrics such as $M^{k,t}_x/N^k_x$ and $M^{l,t}_x/N^l_x$, it enables cross-benchmark investigations into data leakage and memorization, with insights illustrated on OpenBookQA and PIQA. The Koala web interface supports uploading custom n-gram files, visualizing overlap statistics, and verifying the novelty of generated outputs, offering a practical tool for benchmark design and safety research in LLMs. Future work aims to broaden corpus coverage and analytical capabilities to deepen understanding of pre-training data effects.
Abstract
In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.
