Table of Contents
Fetching ...

Packed Acyclic Deterministic Finite Automata

Hiroki Shibata, Masakazu Ishihata, Shunsuke Inenaga

TL;DR

The packed ADFA (PADFA), a compact variant of ADFA, is introduced, which is designed to achieve more efficient pattern searching by encoding specific paths as packed strings stored in contiguous memory.

Abstract

An acyclic deterministic finite automaton (ADFA) is a data structure that represents a set of strings (i.e., a dictionary) and facilitates a pattern searching problem of determining whether a given pattern string is present in the dictionary. We introduce the packed ADFA (PADFA), a compact variant of ADFA, which is designed to achieve more efficient pattern searching by encoding specific paths as packed strings stored in contiguous memory. We theoretically demonstrate that pattern searching in PADFA is near time-optimal with a small additional overhead and becomes fully time-optimal for sufficiently long patterns. Moreover, we prove that a PADFA requires fewer bits than a trie when the dictionary size is relatively smaller than the number of states in the PADFA. Lastly, we empirically show that PADFAs improve both the space and time efficiency of pattern searching on real-world datasets.

Packed Acyclic Deterministic Finite Automata

TL;DR

The packed ADFA (PADFA), a compact variant of ADFA, is introduced, which is designed to achieve more efficient pattern searching by encoding specific paths as packed strings stored in contiguous memory.

Abstract

An acyclic deterministic finite automaton (ADFA) is a data structure that represents a set of strings (i.e., a dictionary) and facilitates a pattern searching problem of determining whether a given pattern string is present in the dictionary. We introduce the packed ADFA (PADFA), a compact variant of ADFA, which is designed to achieve more efficient pattern searching by encoding specific paths as packed strings stored in contiguous memory. We theoretically demonstrate that pattern searching in PADFA is near time-optimal with a small additional overhead and becomes fully time-optimal for sufficiently long patterns. Moreover, we prove that a PADFA requires fewer bits than a trie when the dictionary size is relatively smaller than the number of states in the PADFA. Lastly, we empirically show that PADFAs improve both the space and time efficiency of pattern searching on real-world datasets.

Paper Structure

This paper contains 7 sections, 5 theorems, 1 figure, 1 algorithm.

Key Result

theorem thmcountertheorem

The pattern searching in a PADFA for any ADFA takes $\mathop{}\mathopen{}\mathcal{O}\mathopen{}\left(m/\alpha + \log k\right)$ time.

Figures (1)

  • Figure 1: The trie (left), the minADFA (center), and its SymCDP (right), for a dictionary ${\mathcal{S}} = \{ \rm ab\$, abab\$, ababa\$, bb\$, bbab\$, bbaba\$ \}$. For the trie and minADFA, a vertex labeled $r$ represents a start state, and double circles represent accepting states. For SymCPD, bold and dashed edges represent heavy and light edges, respectively. Each vertex is labeled by the two integers indicating $\pi(r, v)$ and $\pi(v, W)$.

Theorems & Definitions (5)

  • theorem thmcountertheorem
  • theorem thmcountertheorem
  • corollary thmcountercorollary
  • corollary thmcountercorollary
  • corollary thmcountercorollary