Getting Your Indices in a Row: Full-Text Search for LLM Training Data for Real World
Ines Altemir Marinas, Anastasiia Kucherenko, Alexander Sternfeld, Andrei Kucharavy
TL;DR
This paper demonstrates that scalable full-text indexing of large LLM training datasets on energy-efficient ARM64 HPC is feasible by porting Elasticsearch to the ALPS platform and indexing 8.6T tokens from the Apertus corpus. The approach enables offline, open-web-like search and supports safety analyses by providing granular, queryable access to multilingual training data, including harmful-content subsets. Key contributions include a complete ARM64 deployment, performance benchmarks for indexing and querying at scale, and practical safety use cases (weaponized language and chemical terms) that inform data curation. The work emphasizes green computing benefits and argues for broader, data-centric safety practices in LLM development, while highlighting ethical and governance questions surrounding open training data indexes. The resulting infrastructure paves the way for transparent auditing and safer deployment of open-weight LLMs through scalable, open-access data analysis.
Abstract
The performance of Large Language Models (LLMs) is determined by their training data. Despite the proliferation of open-weight LLMs, access to LLM training data has remained limited. Even for fully open LLMs, the scale of the data makes it all but inscrutable to the general scientific community, despite potentially containing critical data scraped from the internet. In this paper, we present the full-text indexing pipeline for the Apertus LLM training data. Leveraging Elasticsearch parallel indices and the Alps infrastructure, a state-of-the-art, highly energy-efficient arm64 supercluster, we were able to index 8.6T tokens out of 15.2T used to train the Apertus LLM family, creating both a critical LLM safety tool and effectively an offline, curated, open web search engine. Our contribution is threefold. First, we demonstrate that Elasticsearch can be successfully ported onto next-generation arm64-based infrastructure. Second, we demonstrate that full-text indexing at the scale of modern LLM training datasets and the entire open web is feasible and accessible. Finally, we demonstrate that such indices can be used to ensure previously inaccessible jailbreak-agnostic LLM safety. We hope that our findings will be useful to other teams attempting large-scale data indexing and facilitate the general transition towards greener computation.
