Table of Contents
Fetching ...

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

TL;DR

This work presents an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion) while leveraging statistical properties of natural language.

Abstract

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

TL;DR

This work presents an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion) while leveraging statistical properties of natural language.

Abstract

We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.
Paper Structure (38 sections, 3 theorems, 9 equations, 15 figures, 18 tables, 1 algorithm)

This paper contains 38 sections, 3 theorems, 9 equations, 15 figures, 18 tables, 1 algorithm.

Key Result

Lemma 1

$L_i = \{ w \in S_i \mid w \ \text{occurs in}\ \mathcal{C} \}$ for each $i$.

Figures (15)

  • Figure 1: An example of search in SoftMatcha 2, which performs a soft search for trillion-scale corpora within 0.3 seconds, including word substitution, insertion, and deletion.
  • Figure 2: The sketch of our soft searching algorithm, when the query is "olympics gold medal". Without iterative pruning, we must search the gray-striped zone as well as the blue zone.
  • Figure 3: The p95 (95th-percentile) latency of exact search for FineWeb-Edu dataset (1.4T tokens). infini-gram mini had reached timeout of index construction for larger corpora, and an error occurred for smaller corpora.
  • Figure 4: The p95 (95th-percentile) latency of soft search for EN (FineWeb-Edu, 1.4T tokens), JA (C4 Japanese, 169B tokens), and ZH (C4 Chinese, 38.3B tokens). SoftMatcha had reached memory limit, timeout of index construction, or errors for larger corpora.
  • Figure 5: The number of exact string matching lookups with and without enabling the pruning techniques over the FineWeb-Edu dataset (1.4T-token dataset and 436M-token subsampled dataset). No data is displayed if a timeout (10 sec.) occurred.
  • ...and 10 more figures

Theorems & Definitions (9)

  • Lemma 1
  • proof
  • proof : Justification
  • Lemma 2: Evaluate $\mathit{Total}$, in a general setting
  • proof
  • proof : Proof of \ref{['itheorem:no-explode-over-m']}
  • Lemma 3
  • proof
  • proof : Proof of \ref{['itheorem:sublinear-corpus']}