Table of Contents
Fetching ...

String Indexing with Compressed Patterns

Philip Bille, Inge Li Gørtz, Teresa Anna Steiner

TL;DR

This work tackles string indexing when the query pattern is given in compressed form, focusing on patterns compressed by LZ77. It introduces a progression of data-structure techniques culminating in a linear-space solution that answers queries in $O(z+\log m+\mathrm{occ})$ time for LZ77-compressed patterns (and extends to LZ78), with near-optimal bounds and practical preprocessing strategies. Central ideas include the phrase trie, a LIS-like LCP decomposition for LZ77, and a slice-tree approach using Karp–Rabin fingerprints and ART decomposition to reduce search depth to $O(\log m)$. Together, these components enable efficient, scalable pattern search on long indexed strings in compressed form, with potential applicability to other compression schemes and fully compressed scenarios. The results have practical significance for client-server settings where queries are transmitted in compressed form, enabling direct processing without decompression and leveraging pattern repetitiveness for speed and space efficiency.

Abstract

Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

String Indexing with Compressed Patterns

TL;DR

This work tackles string indexing when the query pattern is given in compressed form, focusing on patterns compressed by LZ77. It introduces a progression of data-structure techniques culminating in a linear-space solution that answers queries in time for LZ77-compressed patterns (and extends to LZ78), with near-optimal bounds and practical preprocessing strategies. Central ideas include the phrase trie, a LIS-like LCP decomposition for LZ77, and a slice-tree approach using Karp–Rabin fingerprints and ART decomposition to reduce search depth to . Together, these components enable efficient, scalable pattern search on long indexed strings in compressed form, with potential applicability to other compression schemes and fully compressed scenarios. The results have practical significance for client-server settings where queries are transmitted in compressed form, enabling direct processing without decompression and leveraging pattern repetitiveness for speed and space efficiency.

Abstract

Given a string of length , the classic string indexing problem is to preprocess into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.

Paper Structure

This paper contains 43 sections, 8 theorems, 7 equations, 5 figures.

Key Result

Theorem 1

We can solve the string indexing with compressed pattern problem for LZ77-compressed patterns in $O(n)$ space and $O(z+\log m + \mathrm{occ})$ time, where $n$ is the length of the indexing string, $m$ is the length of the pattern, and $z$ is the number of phrases in the LZ77 compressed pattern.

Figures (5)

  • Figure 1: The phrase trie for the string ABABACABABA$. In this example, the leaves are sorted according to the lexicographic order of the originial suffixes. For instance the $6^{th}$ suffix ABABA$ has the LZ77 parse A B (2,3) $, and this string corresponds to the concatenation of labels on the path from the root to the second leaf.
  • Figure 2: The $k^{th}$ phrase in $S'$ is copied from position $p_k-r'_k$, at which point $S$ and $S'$ are identical; the lcp value gives how far $p_k$ and $p_k-r'_k$ match in $S$.
  • Figure 3: The phrase trie for the string ABABACABABA using linear space.
  • Figure 4: Matching in the slice tree: First, we find the lowest $i$ such that the fingerprint of a prefix of $P$ is present at level $2^i$. Then we go to the corresponding slice tree and binary search for fingerprints within the top tree.
  • Figure 5: The prefix of length $2^{i+1}$ can be constructed from the prefix of length $2^i$ and substrings of $\rho$ resp. $\rho\rho$.

Theorems & Definitions (9)

  • Theorem 1
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • Lemma 4.1
  • Lemma 5.1
  • Lemma 5.2: Alstrup et al.alstrup1998marked
  • Lemma 5.3
  • Theorem 2