Koala: An Index for Quantifying Overlaps with Pre-training Corpora

Thuy-Trang Vu; Xuanli He; Gholamreza Haffari; Ehsan Shareghi

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

Thuy-Trang Vu, Xuanli He, Gholamreza Haffari, Ehsan Shareghi

TL;DR

Koala presents a scalable, public index over large pre-training corpora using compressed suffix arrays (FM-Index) to quantify overlaps with training data. It implements a full pre-processing pipeline (cleaning, deduplication with MinHashLSH, and Moses tokenization) and builds per-corpus indexes covering multiple public sources to enable exact-match n-gram overlap analysis. By defining metrics such as $M^{k,t}_x/N^k_x$ and $M^{l,t}_x/N^l_x$, it enables cross-benchmark investigations into data leakage and memorization, with insights illustrated on OpenBookQA and PIQA. The Koala web interface supports uploading custom n-gram files, visualizing overlap statistics, and verifying the novelty of generated outputs, offering a practical tool for benchmark design and safety research in LLMs. Future work aims to broaden corpus coverage and analytical capabilities to deepen understanding of pre-training data effects.

Abstract

In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

TL;DR

and

, it enables cross-benchmark investigations into data leakage and memorization, with insights illustrated on OpenBookQA and PIQA. The Koala web interface supports uploading custom n-gram files, visualizing overlap statistics, and verifying the novelty of generated outputs, offering a practical tool for benchmark design and safety research in LLMs. Future work aims to broaden corpus coverage and analytical capabilities to deepen understanding of pre-training data effects.

Abstract

Paper Structure (11 sections, 2 figures, 2 tables)

This paper contains 11 sections, 2 figures, 2 tables.

Introduction
Pre-processing and Corpora Coverage
Pre-processing Steps
Corpora Coverage
Pipeline and Features of Koala
Data Structure of Koala
$n$-gram Overlap Statistics of Koala
Highlights from Figure \ref{['fig:insights']} (Left Panel):
Highlights from Figure \ref{['fig:insights']} (Right Panel):
Interface of Koala
Conclusion and Future Work

Figures (2)

Figure 1: Visualisations of $n$-gram overlap statistics for OpenBookQA and PIQA test sets, Answer side. Top: OpenBookQA Answer Set ; Bottom: PIQA Answer Set. Left: Average of Per Instance K-gram hit ratio (i.e., K-gram hit ratio = 1 means 100% of k-grams in one instance were a hit); Right: Average of Per Instance K-gram hit length ratio (i.e., K-gram hit length ratio with respect to the instance length = 1 means the k-gram was fully covered, 0.75 means it was 3/4 covered, etc). PIQA test set size is 1838, OpenBookQA test set size is 500.
Figure 2: Screenshots from different features of the Koala webpage. For the latest version of the interface, please refer to the website.

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

TL;DR

Abstract

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

Authors

TL;DR

Abstract

Table of Contents

Figures (2)