SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Masataka Yoneda; Yusuke Matsushita; Go Kamoda; Kohei Suenaga; Takuya Akiba; Masaki Waga; Sho Yokoi

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

TL;DR

This work presents an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion) while leveraging statistical properties of natural language.

Abstract

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

TL;DR

Abstract

Paper Structure (38 sections, 3 theorems, 9 equations, 15 figures, 18 tables, 1 algorithm)

This paper contains 38 sections, 3 theorems, 9 equations, 15 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Corpus Search for NLP and Language Models
String Matching Algorithms
Text Search for Massive Corpora
SoftMatcha 2: Our Soft and Fast Corpus Search Algorithm
Problem Setting
Fast Disk-Aware Suffix Array
Dynamic Corpus-Aware Pruning of Search Space
Theoretical Analysis
Empirical Evaluation
Qualitative and Linguistic Evaluation
Disk Access in Exact Lookup
Latency for Soft Search
Disk Usage and Index Construction
...and 23 more sections

Key Result

Lemma 1

$L_i = \{ w \in S_i \mid w \ \text{occurs in}\ \mathcal{C} \}$ for each $i$.

Figures (15)

Figure 1: An example of search in SoftMatcha 2, which performs a soft search for trillion-scale corpora within 0.3 seconds, including word substitution, insertion, and deletion.
Figure 2: The sketch of our soft searching algorithm, when the query is "olympics gold medal". Without iterative pruning, we must search the gray-striped zone as well as the blue zone.
Figure 3: The p95 (95th-percentile) latency of exact search for FineWeb-Edu dataset (1.4T tokens). infini-gram mini had reached timeout of index construction for larger corpora, and an error occurred for smaller corpora.
Figure 4: The p95 (95th-percentile) latency of soft search for EN (FineWeb-Edu, 1.4T tokens), JA (C4 Japanese, 169B tokens), and ZH (C4 Chinese, 38.3B tokens). SoftMatcha had reached memory limit, timeout of index construction, or errors for larger corpora.
Figure 5: The number of exact string matching lookups with and without enabling the pruning techniques over the FineWeb-Edu dataset (1.4T-token dataset and 436M-token subsampled dataset). No data is displayed if a timeout (10 sec.) occurred.
...and 10 more figures

Theorems & Definitions (9)

Lemma 1
proof
proof : Justification
Lemma 2: Evaluate $\mathit{Total}$, in a general setting
proof
proof : Proof of \ref{['itheorem:no-explode-over-m']}
Lemma 3
proof
proof : Proof of \ref{['itheorem:sublinear-corpus']}

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

TL;DR

Abstract

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (9)