Table of Contents
Fetching ...

Fast Practical Compression of Deterministic Finite Automata

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen

TL;DR

A simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time is presented.

Abstract

We revisit the popular \emph{delayed deterministic finite automaton} (\ddfa{}) compression algorithm introduced by Kumar~et~al.~[SIGCOMM 2006] for compressing deterministic finite automata (DFAs) used in intrusion detection systems. This compression scheme exploits similarities in the outgoing sets of transitions among states to achieve strong compression while maintaining high throughput for matching. The \ddfa{} algorithm and later variants of it, unfortunately, require at least quadratic compression time since they compare all pairs of states to compute an optimal compression. This is too slow and, in some cases, even infeasible for collections of regular expression in modern intrusion detection systems that produce DFAs of millions of states. Our main result is a simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time. We apply our approach to the original \ddfa{} compression algorithm and two important variants, and we experimentally evaluate our algorithms on DFAs from widely used modern intrusion detection systems. Overall, our new algorithms compress up to an order of magnitude faster than existing solutions with either no or little loss of compression size. Consequently, our algorithms are significantly more scalable and can handle larger collections of regular expressions than previous solutions.

Fast Practical Compression of Deterministic Finite Automata

TL;DR

A simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time is presented.

Abstract

We revisit the popular \emph{delayed deterministic finite automaton} (\ddfa{}) compression algorithm introduced by Kumar~et~al.~[SIGCOMM 2006] for compressing deterministic finite automata (DFAs) used in intrusion detection systems. This compression scheme exploits similarities in the outgoing sets of transitions among states to achieve strong compression while maintaining high throughput for matching. The \ddfa{} algorithm and later variants of it, unfortunately, require at least quadratic compression time since they compare all pairs of states to compute an optimal compression. This is too slow and, in some cases, even infeasible for collections of regular expression in modern intrusion detection systems that produce DFAs of millions of states. Our main result is a simple, general framework for constructing \ddfa{} based on locality-sensitive hashing that constructs an approximation of the optimal \ddfa{} in near-linear time. We apply our approach to the original \ddfa{} compression algorithm and two important variants, and we experimentally evaluate our algorithms on DFAs from widely used modern intrusion detection systems. Overall, our new algorithms compress up to an order of magnitude faster than existing solutions with either no or little loss of compression size. Consequently, our algorithms are significantly more scalable and can handle larger collections of regular expressions than previous solutions.
Paper Structure (27 sections, 2 equations, 4 figures, 1 table)

This paper contains 27 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Example from KDYP+2006. (A) DFA $D$ for regular expression .*((ab+c+)|(cd+)|(bd+e)). Edges to $q_0$ are omitted. (B) Space reduction graph for $D$ with edges annotated with similarity. Edges with similarity less than $4$ omitted, except those connecting $q_2$ to avoid disconnecting the graph. (C) D2FA equivalent to $D$. All transitions are shown, default transitions are dashed.
  • Figure 2: Results for the Snort dataset on the algorithms for general compression (top), bounded longest delay (middle), and bounded longest matching delay (bottom). On the left, we show compression time in seconds vs. the number of states in the input DFA. On the right, we show the number of transitions in the D2FA as a percent of the number of transitions in the input DFA.
  • Figure 3: Results for the Suricata dataset on the algorithms for general compression (top), bounded longest delay (middle), and bounded longest matching delay (bottom). On the left, we show compression time in seconds vs. the number of states in the input DFA. On the right, we show the number of transitions in the D2FA as a percent of the number of transitions in the input DFA.
  • Figure 4: Results for the Zeek dataset on the algorithms for general compression (top), bounded longest delay (middle), and bounded longest matching delay (bottom). On the left, we show compression time in seconds vs. the number of states in the input DFA. On the right, we show the number of transitions in the D2FA as a percent of the number of transitions in the input DFA.