Table of Contents
Fetching ...

Incongruity-sensitive access to highly compressed strings

Ferdinando Cicalese, Zsuzsanna Lipták, Travis Gagie, Gonzalo Navarro, Nicola Prezza, Cristian Urbina

TL;DR

The paper addresses fast random access to highly compressed strings by introducing incongruity-sensitive access, where access time depends on how incongruous a position is with its neighbors. It develops space-efficient data structures for run-length SLPs and block trees that achieve $O(\log \ell_q)$ time per character, with $\ell_q$ the length of the longest repeated substring containing the position, and extends these ideas to parses with constraints on source overlaps to obtain $O(h_q + \log_w \ell_q)$ time when the parsing is $\alpha$-contracting. By converting bidirectional parses into such contracting forms at a modest space overhead, the authors obtain $O(b \log_w(n/b))$-space structures that still support fast access, enabling faster queries for relatively incompressible substrings within highly compressed data. These results have implications for efficiently analyzing repetitive data (e.g., genomes) where mutations or rare variants lie in less compressible regions.

Abstract

Random access to highly compressed strings -- represented by straight-line programs or Lempel-Ziv parses, for example -- is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better compression can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size $g_{rl}$ or a block tree of size $L$, we can build an $O (g_{rl})$-space or an $O (L)$-space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases' sources do not overlap much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.

Incongruity-sensitive access to highly compressed strings

TL;DR

The paper addresses fast random access to highly compressed strings by introducing incongruity-sensitive access, where access time depends on how incongruous a position is with its neighbors. It develops space-efficient data structures for run-length SLPs and block trees that achieve time per character, with the length of the longest repeated substring containing the position, and extends these ideas to parses with constraints on source overlaps to obtain time when the parsing is -contracting. By converting bidirectional parses into such contracting forms at a modest space overhead, the authors obtain -space structures that still support fast access, enabling faster queries for relatively incompressible substrings within highly compressed data. These results have implications for efficiently analyzing repetitive data (e.g., genomes) where mutations or rare variants lie in less compressible regions.

Abstract

Random access to highly compressed strings -- represented by straight-line programs or Lempel-Ziv parses, for example -- is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better compression can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size or a block tree of size , we can build an -space or an -space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases' sources do not overlap much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.
Paper Structure (15 sections, 18 theorems, 10 equations, 3 figures)

This paper contains 15 sections, 18 theorems, 10 equations, 3 figures.

Key Result

Lemma 1

Let a locally balanced RLSLP of size $g_{rl}$ generate $S[1\mathinner{.\,.} n]$. Then there exists a data structure of size $O(g_{rl})$ that can access any $S[q]$ in time $O(\log \ell_q)$.

Figures (3)

  • Figure 1: Heavy forest of an RLSLP generating the string $S=abracad(abra)^7cabra$ (we omit trees containing only one node). The edges are labeled with the left and right labels corresponding to each non-root variable. We use bold circles to highlight root nodes.
  • Figure 2: Example of the compacted trie $Z$ with word size $w = 7$ and $H = 4\ge\log_72^7\approx2.48$ for the set $X=\{9,497,508,527,531,844,1379,1381,1382,1385,1410,1871,2040,2276\}$. In the leaf nodes there are the values of $X$. The concatenation of the labels from the root to a leaf node is the representation in base $w$ of the value in that node.
  • Figure 3: Subtrees of the trie $T$ of Lemma \ref{['lem:subintervals2']} with roots $v_1,\dots, v_{d'}$ at depth $\log_d \frac{n}{x_r + x_{r + 1} - x_l - x_{l + 1}}$, containing the $l$th through $r$th leaves $p_l, \dots, p_r$ of $T$ (as explained in the proof of Lemma \ref{['lem:leaves2']}).

Theorems & Definitions (40)

  • Lemma 1
  • proof
  • Corollary 2
  • Corollary 3
  • proof
  • Theorem 4
  • Corollary 5
  • Definition 1
  • Theorem 6
  • Lemma 7
  • ...and 30 more