Table of Contents
Fetching ...

Exploiting New Properties of String Net Frequency for Efficient Computation

Peaker Guo, Patrick Eades, Anthony Wirth, Justin Zobel

TL;DR

This work defines and efficiently computes net frequency (NF), a measure that identifies significant substrings by excluding occurrences embedded in longer repeats. It introduces a simplified NF characterization and leverages suffix arrays, Burrows-Wheeler transform, LF mapping, LCP, and Coloured Range Listing to achieve SINGLE-NF in $O(m + \sigma)$ time and ALL-NF in $O(n)$ time with linear-space preprocessing. The authors validate the approach on Fibonacci words and large real-world texts, showing substantial speedups over baselines and highlighting practical feasibility for both per-string NF queries and complete NF catalogs. The results have direct implications for text indexing, compression, and NLP preprocessing where identifying meaningful substrings is valuable.

Abstract

Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length $m$, in an input text of length $n$ over an alphabet size $σ$. Second, \textsc{all-nf}, given length-$n$ input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has $O(n)$ construction cost: with this structure, we solve \textsc{single-nf} in $O(m + σ)$ time and \textsc{all-nf} in $O(n)$ time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method.

Exploiting New Properties of String Net Frequency for Efficient Computation

TL;DR

This work defines and efficiently computes net frequency (NF), a measure that identifies significant substrings by excluding occurrences embedded in longer repeats. It introduces a simplified NF characterization and leverages suffix arrays, Burrows-Wheeler transform, LF mapping, LCP, and Coloured Range Listing to achieve SINGLE-NF in time and ALL-NF in time with linear-space preprocessing. The authors validate the approach on Fibonacci words and large real-world texts, showing substantial speedups over baselines and highlighting practical feasibility for both per-string NF queries and complete NF catalogs. The results have direct implications for text indexing, compression, and NLP preprocessing where identifying meaningful substrings is valuable.

Abstract

Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given string of length , in an input text of length over an alphabet size . Second, \textsc{all-nf}, given length- input text, how to report every string of positive net frequency. Our methods leverage suffix arrays, components of the Burrows-Wheeler transform, and solution to the coloured range listing problem. We show that, for both problems, our data structure has construction cost: with this structure, we solve \textsc{single-nf} in time and \textsc{all-nf} in time. Experimentally, we find our method to be around 100 times faster than reasonable baselines for \textsc{single-nf}. For \textsc{all-nf}, our results show that, even with prior knowledge of the set of strings with positive net frequency, simply confirming that their net frequency is positive takes longer than with our purpose-designed method.
Paper Structure (22 sections, 11 theorems, 3 equations, 8 figures, 4 tables, 5 algorithms)

This paper contains 22 sections, 11 theorems, 3 equations, 8 figures, 4 tables, 5 algorithms.

Key Result

Lemma 1

The sum of irreducible LCP values is at most $\text{O}(n \log \delta)$.

Figures (8)

  • Figure 1: Illustration of proof of \ref{['thm:f-i-2-net-occ']}. Two factorisations of $F_8$ are depicted with rectangles.
  • Figure 2: Illustration of \ref{['thm:q-i']} with $F_8$. Note that $F_{i-5} = \texttt{ab}$ and $F_{i-4} = \texttt{aba}$.
  • Figure 3: Net frequency distribution on input texts of two different lengths drawn from NYT corpus. For each string length the column shows the percentage of strings with positive NF; strings of NF larger than $4$ are so rare that they are not visible in this plot.
  • Figure 4: Average single-nf query time (in microseconds) of ASA and CRL against query string frequency (left) and length (right) on the NYT dataset. Note that the $y$-axis on the right is scaled logarithmically.
  • Figure 5: Average single-nf query time (in microseconds) of ASA and CRL against query string frequency (left) and length (right) on the DNA dataset. Note that the $y$-axis on the right is scaled logarithmically.
  • ...and 3 more figures

Theorems & Definitions (22)

  • Lemma 1: conf/focs/2020/kempa
  • Definition 2: Extensions
  • Definition 3: Net frequency journal/jise/2001/lin
  • Theorem 4: Net frequency characteristic
  • Lemma 5
  • Definition 6: Net occurrence
  • Theorem 8
  • Definition 9: $Q_i$ and $\Delta(j)$
  • Lemma 10
  • Theorem 11
  • ...and 12 more