Table of Contents
Fetching ...

Engineering faster double-array Aho-Corasick automata

Shunsuke Kanda, Koichi Akabe, Yusuke Oda

TL;DR

This work targets efficient multiple pattern matching via double-array Aho-Corasick automata (DAACs). It surveys and categorizes a wide range of implementation techniques, proposes new optimizations, and conducts exhaustive experiments on real-world datasets to identify the best technique combinations. The authors implement Daachorse in Rust, demonstrate superior speed and memory performance over existing AC implementations, and show practical impact by integrating it with Vaporetto, a Japanese tokenizer, achieving substantial speedups (e.g., up to 2.6x). The study provides actionable guidance for engineers and contributes an open-source, high-performance DAAC library suitable for fast pattern matching across languages and applications.

Abstract

Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.

Engineering faster double-array Aho-Corasick automata

TL;DR

This work targets efficient multiple pattern matching via double-array Aho-Corasick automata (DAACs). It surveys and categorizes a wide range of implementation techniques, proposes new optimizations, and conducts exhaustive experiments on real-world datasets to identify the best technique combinations. The authors implement Daachorse in Rust, demonstrate superior speed and memory performance over existing AC implementations, and show practical impact by integrating it with Vaporetto, a Japanese tokenizer, achieving substantial speedups (e.g., up to 2.6x). The study provides actionable guidance for engineers and contributes an open-source, high-performance DAAC library suitable for fast pattern matching across languages and applications.

Abstract

Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
Paper Structure (47 sections, 1 theorem, 6 equations, 14 figures, 12 tables, 4 algorithms)

This paper contains 47 sections, 1 theorem, 6 equations, 14 figures, 12 tables, 4 algorithms.

Key Result

Theorem 1

Let $B = 2^{\lceil{\log_2 |\Sigma|}\rceil}$.In the Mapped scheme, $\sigma$ is used instead of $|\Sigma|$. When state ids are defined using Equation eq:da_xor, all destination states from a state are always placed in the same block.

Figures (14)

  • Figure 1: Examples of (a) an AC automaton for the dictionary of Table \ref{['tab:dictioanry']} and (b) its trie part. Transitions are depicted by solid line arrows. $\delta(0,\texttt{b}) = 2$, $\delta(2,\texttt{a}) = 5$, and $\delta(2,\texttt{c}) = -1$. We depict the mappings of the failure function (except ones to the initial state) by dotted line arrows. $f(4)=2$, $f(5)=1$, $f(6)=2$, $f(7)=3$, $f(8)=4$, and $f(s) = 0$ for the other states $s$. Output states are shaded and associated with pattern indices (drawn from $\texttt{A},\texttt{B},\texttt{C},\dots$). $h(2)=\{\texttt{B}\}$, $h(4)=\{\texttt{A,B}\}$, $h(6)=\{\texttt{E,B}\}$, $h(7)=\{\texttt{F}\}$, $h(8)=\{\texttt{C,A,B}\}$, $h(9)=\{\texttt{D}\}$, and $h(s) = \emptyset$ for the other states $s$.
  • Figure 2: BASE and CHECK implementing the transition function $\delta$ of Figure \ref{['fig:ac:trie']}. $\Sigma = \{ \texttt{a} = 0,\texttt{b} = 1,\texttt{c} = 2,\texttt{d} = 3 \}$. The state ids are assigned to satisfy Equation \ref{['eq:da']}. $\delta(0,\texttt{b}) = 2$ is simulated by $\textsf{BASE}[0] + \texttt{b} = 2$ and $\textsf{CHECK}[2] = 0$. $\delta(2,\texttt{d}) = -1$ is simulated by $\textsf{BASE}[2] + \texttt{d} = 8$ and $\textsf{CHECK}[8] \neq 2$. The state id 7 is a vacant id because its element does not represent any state of the original trie.
  • Figure 3: DAAC for the AC automaton in Figure \ref{['fig:ac:pma']}.
  • Figure 4: Examples of approaches to store output sets of Figure \ref{['fig:da']}.
  • Figure 5: Illustrations of memory layouts of arrays.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof