Table of Contents
Fetching ...

Pattern Discovery in Colored Strings

Zsuzsanna Lipták, Simon J. Puglisi, Massimiliano Rossi

TL;DR

This work introduces colored strings to model embedded-system traces and studies pattern discovery where substrings are guaranteed to be followed by the same color after a fixed delay. It develops suffix-tree–based algorithms to enumerate minimally $(y,d)$-unique substrings for a fixed color and for all colors, with a baseline approach running in $O(n^2)$ time and a skipping variant achieving $O(n^2\log n)$ in the worst case. The methods are extended to all colors and augmented with output-restriction variants (real-type outputs) and a fast-h optimization, yielding practical speedups on simulated data and competitive results on real RTL traces. Experimental results confirm the approaches' viability for assertion mining in embedded systems, and the code is made available for integration into verification pipelines.

Abstract

In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by $\mathcal{O}(n^2)$, where $n$ is the length of the string. We introduce a baseline algorithm, running in $\mathcal{O}(n^2)$ time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in $\mathcal{O}(n^2\log n)$ time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Pattern Discovery in Colored Strings

TL;DR

This work introduces colored strings to model embedded-system traces and studies pattern discovery where substrings are guaranteed to be followed by the same color after a fixed delay. It develops suffix-tree–based algorithms to enumerate minimally -unique substrings for a fixed color and for all colors, with a baseline approach running in time and a skipping variant achieving in the worst case. The methods are extended to all colors and augmented with output-restriction variants (real-type outputs) and a fast-h optimization, yielding practical speedups on simulated data and competitive results on real RTL traces. Experimental results confirm the approaches' viability for assertion mining in embedded systems, and the code is made available for integration into verification pipelines.

Abstract

In this paper, we consider the problem of identifying patterns of interest in colored strings. A colored string is a string where each position is assigned one of a finite set of colors. Our task is to find substrings of the colored string that always occur followed by the same color at the same distance. The problem is motivated by applications in embedded systems verification, in particular, assertion mining. The goal there is to automatically find properties of the embedded system from the analysis of its simulation traces. We show that, in our setting, the number of patterns of interest is upper-bounded by , where is the length of the string. We introduce a baseline algorithm, running in time, which identifies all patterns of interest satisfying certain minimality conditions, for all colors in the string. For the case where one is interested in patterns related to one color only, we also provide a second algorithm which runs in time in the worst case but is faster than the baseline algorithm in practice. Both solutions use suffix trees, and the second algorithm also uses an appropriately defined priority queue, which allows us to reduce the number of computations. We performed an experimental evaluation of the proposed approaches over both synthetic and real-world datasets, and found that the second algorithm outperforms the first algorithm on all simulated data, while on the real-world data, the performance varies between a slight slowdown (on half of the datasets) and a speedup by a factor of up to 11.

Paper Structure

This paper contains 23 sections, 6 theorems, 2 equations, 5 figures, 2 tables, 6 algorithms.

Key Result

Lemma 1

Given string $S$ of length $n$, the number of minimally $(y,d)$-unique substrings of $S$, over all $y\in \Gamma$ and $d=0,\ldots,n$, is $\mathcal{O}(n^2)$.

Figures (5)

  • Figure 1: Example of a simulation trace of a device with input ports ${\cal I} = \{i_1, i_2, i_3\}$, and output ports ${\cal O} = \{o_1, o_2\}$. The mapping of the input and output values of the trace into the input and output alphabet respectively. The colored string associated to the simulation trace, after the mapping. The solid and dashed shaded and non-shaded boxed values in the simulation trace highlight that every time we see the sequence of input values, then we have the corresponding output value. The solid non-shaded boxed characters in the colored string are the mapping of the corresponding solid non-shaded boxed values in the simulation trace.
  • Figure 2: The suffix tree ${\cal T}$ of the reverse string $S^{\textrm{rev}} = {\tt bacabcacaca}$, where $S = {\tt acacacbacab}$, see Example \ref{['ex:ex1']}. For clarity, the edges carry the label itself rather than a pair of pointers into the string. Suffix links are drawn as dotted directed edges.
  • Figure 3: Top (\ref{['fig:h-function']}): The suffix tree of ${\cal T}$ of the reverse string $S^{\textrm{rev}} = {\tt bacabcacaca}$, reporting the values of $IPQ$, $left\_minimal[0]$, and $left\_minimal[1]$ for each node, after $48$ iterations of Algorithm \ref{['algo:lazy']}. In red, in the upper left of each node, we report the reverse index BFS of the node, below each leaf we report the associated suffix number, on the right of the node we report the values of $IPQ$, $left\_minimal[0]$, and $left\_minimal[1]$. Bottom: The indexed priority queue $IPQ$ after $48$ (\ref{['fig:impq-1']}) and $49$ (\ref{['fig:impq-2']}) iterations of Algorithm \ref{['algo:lazy']}. In the nodes of the priority queue we have the index of the nodes of the suffix tree ${\cal T}(S^\textrm{rev})$ numbered in the reverse index BFS. Below each node, in red and in brackets, the value of the key associated to each index.
  • Figure 4: Results of the execution of algorithms base (color blue), skip (color orange), real (color green), and base-all (color red) over the randomly generated data for $N = 10^3$, $10^4$, and $10^5$. The $x$ axis represents the values of $\sigma = \{2,8,32\}$, and the different markers represents the values of $\gamma = \{2,8,32\}$ (circles, crosses, and boxes, respectively). The three plots in the first row report execution times. The plots in the second row report speedups of skip, real, and base-all with respect to algorithm base represented as the dashed blue line at constant $1.0$.
  • Figure 5: Results of the execution of algorithms base, skip, real, and base-all over the real-world dataset. The plot in Figure \ref{['fig:real:time']} reports execution times. The plot in Figure \ref{['fig:real:speedup']} reports speedups of skip, real, and base-all with respect to algorithm base represented as the dashed blue line at constant $1.0$.

Theorems & Definitions (21)

  • Example 1
  • Definition 1: $y$-good, $y$-unique, minimal
  • Lemma 1
  • Proof 1
  • Lemma 2
  • Proof 2
  • Corollary 1
  • Example 2
  • Definition 2: Left-minimal nodes, left-minimal labels
  • Example 3
  • ...and 11 more