Table of Contents
Fetching ...

String Covering: A Survey

Neerja Mhaskar, W. F. Smyth

TL;DR

The survey addresses how string covers and seeds can yield compact, interpretable representations of long strings by formalizing quasiperiodicity through $u$-covers of $x$ and seeds as covers of superstrings. It systematically catalogs algorithmic approaches, foundational data structures like the suffix array and LCP array, and a broad spectrum of cover/seed variants including partial, maximal, frequency, approximate, and 2D forms, as well as extensions to indeterminate and weighted strings. Key results include linear-time algorithms for specific cover/seed problems, NP-hardness for minimum $k$-cover computations, and the development of tools such as the cover suffix tree and package representations to manage seeds. The work highlights open problems and future directions with significant implications for pattern discovery in biology and scalable string processing.

Abstract

The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future.

String Covering: A Survey

TL;DR

The survey addresses how string covers and seeds can yield compact, interpretable representations of long strings by formalizing quasiperiodicity through -covers of and seeds as covers of superstrings. It systematically catalogs algorithmic approaches, foundational data structures like the suffix array and LCP array, and a broad spectrum of cover/seed variants including partial, maximal, frequency, approximate, and 2D forms, as well as extensions to indeterminate and weighted strings. Key results include linear-time algorithms for specific cover/seed problems, NP-hardness for minimum -cover computations, and the development of tools such as the cover suffix tree and package representations to manage seeds. The work highlights open problems and future directions with significant implications for pattern discovery in biology and scalable string processing.

Abstract

The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future.
Paper Structure (22 sections, 8 equations, 2 figures)

This paper contains 22 sections, 8 equations, 2 figures.

Figures (2)

  • Figure 1: Suffix array, $\mathcal{LCP}/\mathcal{RSF}/\mathcal{OLP}$ arrays and corresponding suffix tree for $\hbox{\boldmath $x$} = abaababa$ --- adapted from S13.
  • Figure 2: $LS_{min}$, $LS_{max}$, $RS_{min}$ and $RS_{max}$ are the minimal left seed, maximal left seed, minimal right seed and maximal right seed arrays, respectively, computed for the string $\hbox{\boldmath $x$}=abaababaabaabab$ --- adapted from LRSeed.