Table of Contents
Fetching ...

Online Computation of String Net Frequency

Peaker Guo, Seeun William Umboh, Anthony Wirth, Justin Zobel

TL;DR

This work tackles online computation of the net frequency $\phi(S)$ of a string $S$ in a text $T$, introducing the SINGLE-NF and ALL-NF problems for streaming texts. The authors develop suffix-tree–based methods, including a new NF characteristic and Weiner-link techniques, to achieve optimal-time solutions: $O(m)$ for online SINGLE-NF and $O(n)$ for online ALL-NF under a constant alphabet. They provide both offline foundations and online adaptations, building on Ukkonen's online suffix tree construction and results on implicit nodes to handle dynamic updates. The results demonstrate that online NF computation can match the offline efficiency, enabling fast querying and reporting for strings with positive NF in streaming texts, with potential practical impact in NLP and related string-processing tasks.

Abstract

The net frequency (NF) of a string, of length $m$, in a text, of length $n$, is the number of occurrences of the string in the text with unique left and right extensions. Recently, Guo et al. [CPM 2024] showed that NF is combinatorially interesting and how two key questions can be computed efficiently in the offline setting. First, SINGLE-NF: reporting the NF of a query string in an input text. Second, ALL-NF: reporting an occurrence and the NF of each string of positive NF in an input text. For many applications, however, facilitating these computations in an online manner is highly desirable. We are the first to solve the above two problems in the online setting, and we do so in optimal time, assuming, as is common, a constant-size alphabet: SINGLE-NF in $O(m)$ time and ALL-NF in $O(n)$ time. Our results are achieved by first designing new and simpler offline algorithms using suffix trees, proving additional properties of NF, and exploiting Ukkonen's online suffix tree construction algorithm and results on implicit node maintenance in an implicit suffix tree by Breslauer and Italiano.

Online Computation of String Net Frequency

TL;DR

This work tackles online computation of the net frequency of a string in a text , introducing the SINGLE-NF and ALL-NF problems for streaming texts. The authors develop suffix-tree–based methods, including a new NF characteristic and Weiner-link techniques, to achieve optimal-time solutions: for online SINGLE-NF and for online ALL-NF under a constant alphabet. They provide both offline foundations and online adaptations, building on Ukkonen's online suffix tree construction and results on implicit nodes to handle dynamic updates. The results demonstrate that online NF computation can match the offline efficiency, enabling fast querying and reporting for strings with positive NF in streaming texts, with potential practical impact in NLP and related string-processing tasks.

Abstract

The net frequency (NF) of a string, of length , in a text, of length , is the number of occurrences of the string in the text with unique left and right extensions. Recently, Guo et al. [CPM 2024] showed that NF is combinatorially interesting and how two key questions can be computed efficiently in the offline setting. First, SINGLE-NF: reporting the NF of a query string in an input text. Second, ALL-NF: reporting an occurrence and the NF of each string of positive NF in an input text. For many applications, however, facilitating these computations in an online manner is highly desirable. We are the first to solve the above two problems in the online setting, and we do so in optimal time, assuming, as is common, a constant-size alphabet: SINGLE-NF in time and ALL-NF in time. Our results are achieved by first designing new and simpler offline algorithms using suffix trees, proving additional properties of NF, and exploiting Ukkonen's online suffix tree construction algorithm and results on implicit node maintenance in an implicit suffix tree by Breslauer and Italiano.
Paper Structure (15 sections, 16 theorems, 5 equations, 4 figures, 3 algorithms)

This paper contains 15 sections, 16 theorems, 5 equations, 4 figures, 3 algorithms.

Key Result

lemma thmcounterlemma

Given a repeated string $S$,

Figures (4)

  • Figure 1: The suffix tree (left) and implicit suffix tree (right) for text aabaabababaa. Leaves (squares) and implicit nodes (red dots) are numbered; green arrows are suffix links coming from branching nodes.
  • Figure 2: for offline all-nf
  • Figure 3: Illustration of \ref{['thm:locus-unique-right-ext']}: \ref{['case-1']} (left), \ref{['case-2']} (middle), and \ref{['case-3']} (right). Black dots, coloured dots, and squares represent branching nodes, implicit nodes, and leaves, respectively. Each dashed (non-existent) edge has label $ and leads to a dashed (non-existent) leaf node. In each case, each implicit node has its own colour. The implicit node $(u, d)$ is also labelled, and the leaf nodes corresponding to its unique right extensions share the same colour as $(u, d)$.
  • Figure 4: Let $r$ be the root node of the implicit suffix tree. An edge is shown straight; a path is shown squiggly. In \ref{['fig:wlink-theorem']}, some possible locations of nodes $u$, $w$, and $v := \operatorname{parent}^{}(u)\xspace$ are shown: $v \neq r \neq w$ (left), $v = r = w$ (top right), $v= r \neq u = w$ (bottom right). Each green arrow indicates an implicit Weiner link from $(u, d)$ to $(q, \ell)$. In \ref{['fig:wlink-proof']}, compare scenarios when $v^*$ exists (left) and when $v^*$ does not exist (right). A node is coloured grey to indicate that it exists only under false assumption. Next to several nodes are corresponding path labels.

Theorems & Definitions (26)

  • definition thmcounterdefinition: Net occurrence conf/cpm/2024/guo
  • lemma thmcounterlemma: NF characteristic conf/cpm/2024/guo
  • lemma thmcounterlemma: conf/cpm/2024/guo
  • definition thmcounterdefinition: Implicit node
  • definition thmcounterdefinition
  • lemma thmcounterlemma: journal/tcs/2012/breslauer
  • theorem thmcountertheorem: journal/tcs/2012/breslauerjournal/algorithmica/1995/ukkonen
  • theorem thmcountertheorem: Suffix tree NF characteristic
  • proof
  • proposition thmcounterproposition
  • ...and 16 more