Table of Contents
Fetching ...

Dynamic Suffix Array in Optimal Compressed Space

Takaaki Nishimoto, Yasuo Tabei

TL;DR

This work presents the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and $\delta$-optimal space, and can answer SA queries and perform updates in polylogarithmic time and $\delta$-optimal space.

Abstract

Big data, encompassing extensive datasets, has seen rapid expansion, notably with a considerable portion being textual data, including strings and texts. Simple compression methods and standard data structures prove inadequate for processing these datasets, as they require decompression for usage or consume extensive memory resources. Consequently, this motivation has led to the development of compressed data structures that support various queries for a given string, typically operating in polylogarithmic time and utilizing compressed space proportional to the string's length. Notably, the suffix array (SA) query is a critical component in implementing a suffix tree, which has a broad spectrum of applications. A line of research has been conducted on (especially, static) compressed data structures that support the SA query. A common finding from most of the studies is the suboptimal space efficiency of existing compressed data structures. Kociumaka, Navarro, and Prezza [IEEE Trans. Inf. Theory 2023] have made a significant contribution by introducing an asymptotically minimal space requirement, $O\left(δ\log\frac{n\logσ}{δ\log n} \log n \right)$ bits ($δ$-optimal space), sufficient to represent any string of length $n$, with an alphabet size of $σ$, and substring complexity $δ$, serving as a measure of repetitiveness. More recently, Kempa and Kociumaka [FOCS 2023] presented $δ$-SA, a compressed data structure supporting SA queries in $δ$-optimal space. However, the data structures introduced thus far are static. We present the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and $δ$-optimal space. More precisely, it can answer SA queries and perform updates in $O(\log^7 n)$ and expected $O(\log^8 n)$ time, respectively, using an expected $δ$-optimal space.

Dynamic Suffix Array in Optimal Compressed Space

TL;DR

This work presents the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and -optimal space, and can answer SA queries and perform updates in polylogarithmic time and -optimal space.

Abstract

Big data, encompassing extensive datasets, has seen rapid expansion, notably with a considerable portion being textual data, including strings and texts. Simple compression methods and standard data structures prove inadequate for processing these datasets, as they require decompression for usage or consume extensive memory resources. Consequently, this motivation has led to the development of compressed data structures that support various queries for a given string, typically operating in polylogarithmic time and utilizing compressed space proportional to the string's length. Notably, the suffix array (SA) query is a critical component in implementing a suffix tree, which has a broad spectrum of applications. A line of research has been conducted on (especially, static) compressed data structures that support the SA query. A common finding from most of the studies is the suboptimal space efficiency of existing compressed data structures. Kociumaka, Navarro, and Prezza [IEEE Trans. Inf. Theory 2023] have made a significant contribution by introducing an asymptotically minimal space requirement, bits (-optimal space), sufficient to represent any string of length , with an alphabet size of , and substring complexity , serving as a measure of repetitiveness. More recently, Kempa and Kociumaka [FOCS 2023] presented -SA, a compressed data structure supporting SA queries in -optimal space. However, the data structures introduced thus far are static. We present the first dynamic compressed data structure that supports the SA query and update in polylogarithmic time and -optimal space. More precisely, it can answer SA queries and perform updates in and expected time, respectively, using an expected -optimal space.
Paper Structure (475 sections, 326 theorems, 247 equations, 10 figures, 6 tables, 2 algorithms)

This paper contains 475 sections, 326 theorems, 247 equations, 10 figures, 6 tables, 2 algorithms.

Key Result

Theorem 5.1

For any two suffixes $T[\mathsf{SA}[i]..n]$ and $T[\mathsf{SA}[i']..n]$ in the sa-interval of a string $P (|P| \geq 2)$ on the suffix array $\mathsf{SA}$ of $T$, there exist nodes $u, u' \in \mathcal{U}$ such that $[\mathsf{SA}[i], \mathsf{SA}[i] + |P|-1] \in \Delta(u)$ and $T[\mathsf{SA}[i']..(\mat

Figures (10)

  • Figure 1: An illustration of RSC query $\mathbf{RSC}(s, e)$ (left) and RSS query $\mathbf{RSS}(P, b)$ (right).
  • Figure 2: An illustration of Theorem \ref{['theo:intro_sa_query']} (left) and Theorem \ref{['theo:intro_isa_query']} (right).
  • Figure 3: An illustration of the suffix array of string $T = \mathrm{cbb abab cbb abab}$.
  • Figure 4: An Illustration of the derivation tree $\mathcal{T}$ of SLP $\mathcal{G}$ with set $\{ u_{1}, u_{2}, \ldots, u_{21} \}$ of nodes. Each rectangle represents the node written in the lower left corner of the rectangle, and the nonterminal in the rectangle depicts the label of the corresponding node.
  • Figure 5: An illustration of RSC query $\mathbf{RSC}(s, e)$ (left) and RSS query $\mathbf{RSS}(P, b)$ (right).
  • ...and 5 more figures

Theorems & Definitions (708)

  • Theorem 5.1
  • Definition 6.1: RSC query
  • Definition 6.2: RSS query
  • Theorem 6.3
  • Theorem 6.4
  • Theorem 7.1
  • Theorem 7.2
  • Theorem 8.1
  • Theorem 9.1
  • Theorem 9.2
  • ...and 698 more