Table of Contents
Fetching ...

Time-Optimal Construction of String Synchronizing Sets

Jonas Ellert, Tomasz Kociumaka

TL;DR

This work addresses the problem of time-optimal construction of $\tau$-synchronizing sets for strings. It introduces a preprocessing framework that enables constructing $\tau$-synchronizing sets in $O\left(\frac{n\log\tau}{\tau\log n}\right)$ time after an $O\left(\frac{n}{\log_{\sigma} n}\right)$ preprocessing step, for any $\tau\le n$, while output formats include explicit lists or bitmasks and a compact indexable dictionary with $select$ in $O(1)$ and $rank$ in $O\left(\log\frac{\log\tau}{\log\log n}\right)$. The approach builds on a refined restricted recompression framework and a novel encoding for sparse integer sequences, enabling sublinear-time preprocessing and optimal query performance. By enabling efficient construction and querying of synchronizing sets, the results have broad impact for data compression, indexing, and string similarity in the word RAM model. Overall, the paper delivers both deep theoretical insights and practical data structures for local consistency in string processing.

Abstract

A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A $τ$-synchronizing set of a length-$n$ string is a set of $O(n/τ)$ positions, chosen via their length-$2τ$ contexts, such that (outside highly periodic regions) at least one position in every length-$τ$ window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string $T \in [0..σ)^n$ in $O(n\logσ/\log n)$ time so that, for any $τ\in[1..n]$, a $τ$-synchronizing set of $T$ can be constructed in $O((n\logτ)/(τ\log n))$ time. Both bounds are optimal in the word RAM model with word size $w=Θ(\log n)$. Previously, the construction time was $O(n/τ)$, either after an $O(n)$-time preprocessing [Kociumaka, Radoszewski, Rytter, Waleń; SICOMP 2024], or without preprocessing if $τ<0.2\log_σn$ [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in $O(n/τ)$ time, or as a bitmask in $O(n/\log n)$ time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in $O(1)$ time and rank queries in $O(\log(\tfrac{\logτ}{\log\log n}))$ time, matching unconditional cell-probe lower bounds for $τ\le n^{1-Ω(1)}$. We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [Pătraşcu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).

Time-Optimal Construction of String Synchronizing Sets

TL;DR

This work addresses the problem of time-optimal construction of -synchronizing sets for strings. It introduces a preprocessing framework that enables constructing -synchronizing sets in time after an preprocessing step, for any , while output formats include explicit lists or bitmasks and a compact indexable dictionary with in and in . The approach builds on a refined restricted recompression framework and a novel encoding for sparse integer sequences, enabling sublinear-time preprocessing and optimal query performance. By enabling efficient construction and querying of synchronizing sets, the results have broad impact for data compression, indexing, and string similarity in the word RAM model. Overall, the paper delivers both deep theoretical insights and practical data structures for local consistency in string processing.

Abstract

A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A -synchronizing set of a length- string is a set of positions, chosen via their length- contexts, such that (outside highly periodic regions) at least one position in every length- window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string in time so that, for any , a -synchronizing set of can be constructed in time. Both bounds are optimal in the word RAM model with word size . Previously, the construction time was , either after an -time preprocessing [Kociumaka, Radoszewski, Rytter, Waleń; SICOMP 2024], or without preprocessing if [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in time, or as a bitmask in time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in time and rank queries in time, matching unconditional cell-probe lower bounds for . We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [Pătraşcu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).
Paper Structure (13 sections, 14 theorems, 1 equation, 1 figure)

This paper contains 13 sections, 14 theorems, 1 equation, 1 figure.

Key Result

Theorem 1

A string $T\in [0\mathop{.\,.} \sigma)^n$ can be preprocessed in ${\mathcal{O}}(n/\log_\sigma n)$ time so that, given $\tau \le \frac{1}{2}n$, a $\tau$-synchronizing set $\mathsf{Sync}$ of $T$ of size ${|\mathsf{Sync}| < \frac{70n}{\tau}}$ can be constructed in ${\mathcal{O}}(\frac{n}{\tau})$ time.

Figures (1)

  • Figure 1: Even and odd rounds of restricted recompression.

Theorems & Definitions (16)

  • Theorem 1: Simplified version of \ref{['thm:ss-explicit']}
  • Theorem 2: Simplified version of \ref{['thm:sss_sparse_with_support']}
  • Theorem 3
  • Proposition 4: DBLP:journals/siamcomp/KociumakaRRW24
  • Remark 5
  • Definition 6
  • Corollary 7: of DBLP:journals/siamcomp/KociumakaRRW24
  • Corollary 8
  • Lemma 8
  • Lemma 9
  • ...and 6 more