Incongruity-sensitive access to highly compressed strings
Ferdinando Cicalese, Zsuzsanna Lipták, Travis Gagie, Gonzalo Navarro, Nicola Prezza, Cristian Urbina
TL;DR
The paper addresses fast random access to highly compressed strings by introducing incongruity-sensitive access, where access time depends on how incongruous a position is with its neighbors. It develops space-efficient data structures for run-length SLPs and block trees that achieve $O(\log \ell_q)$ time per character, with $\ell_q$ the length of the longest repeated substring containing the position, and extends these ideas to parses with constraints on source overlaps to obtain $O(h_q + \log_w \ell_q)$ time when the parsing is $\alpha$-contracting. By converting bidirectional parses into such contracting forms at a modest space overhead, the authors obtain $O(b \log_w(n/b))$-space structures that still support fast access, enabling faster queries for relatively incompressible substrings within highly compressed data. These results have implications for efficiently analyzing repetitive data (e.g., genomes) where mutations or rare variants lie in less compressible regions.
Abstract
Random access to highly compressed strings -- represented by straight-line programs or Lempel-Ziv parses, for example -- is a well-studied topic. Random access to such strings in strongly sublogarithmic time is impossible in the worst case, but previous authors have shown how to support faster access to specific characters and their neighbourhoods. In this paper we explore whether, since better compression can impede access, we can support faster access to relatively incompressible substrings of highly compressed strings. We first show how, given a run-length compressed straight-line program (RLSLP) of size $g_{rl}$ or a block tree of size $L$, we can build an $O (g_{rl})$-space or an $O (L)$-space data structure, respectively, that supports access to any character in time logarithmic in the length of the longest repeated substring containing that character. That is, the more incongruous a character is with respect to the characters around it in a certain sense, the faster we can support access to it. We then prove a similar but more powerful and sophisticated result for parsings in which phrases' sources do not overlap much larger phrases, with the query time depending also on the number of phrases we must copy from their sources to obtain the queried character.
