String Covering: A Survey
Neerja Mhaskar, W. F. Smyth
TL;DR
The survey addresses how string covers and seeds can yield compact, interpretable representations of long strings by formalizing quasiperiodicity through $u$-covers of $x$ and seeds as covers of superstrings. It systematically catalogs algorithmic approaches, foundational data structures like the suffix array and LCP array, and a broad spectrum of cover/seed variants including partial, maximal, frequency, approximate, and 2D forms, as well as extensions to indeterminate and weighted strings. Key results include linear-time algorithms for specific cover/seed problems, NP-hardness for minimum $k$-cover computations, and the development of tools such as the cover suffix tree and package representations to manage seeds. The work highlights open problems and future directions with significant implications for pattern discovery in biology and scalable string processing.
Abstract
The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future.
