Fast and Small Subsampled R-indexes

Dustin Cobas; Travis Gagie; Gonzalo Navarro

Fast and Small Subsampled R-indexes

Dustin Cobas, Travis Gagie, Gonzalo Navarro

TL;DR

The sr-index is introduced, a variant that limits a large fraction of the space to ${\mathcal{O}}(\min(r,n/s))$ for a text of length $ n $ and a given parameter $ s $ , at the expense of multiplying by $ s $ the time per occurrence reported.

Abstract

The $r$-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage, $O(r)$ where $r$ is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the $sr$-index, a variant that limits the space to $O(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained subsampling the text positions indexed by the $r$-index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the $sr$-index, because it performs much better on real texts than on synthetic ones: the $sr$-index retains the performance of the $r$-index while using 1.5--4.0 times less space, sharply outperforming {\em virtually every other} compressed index on repetitive texts in both time and space. Only a particular LZ-based index uses less space than the $sr$-index, but it is an order of magnitude slower. Our second contribution are the $r$-csa and $sr$-csa indexes. Just like the $r$-index adapts the well-known FM-Index to repetitive texts, the $r$-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the $r$-index turn out to fit naturally and efficiently in the CSA framework. The $sr$-csa is the corresponding subsampled version of the $r$-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, we show that the $sr$-csa outperforms the $sr$-index on repetitive texts over those larger alphabets and some DNA texts as well.

Fast and Small Subsampled R-indexes

TL;DR

The sr-index is introduced, a variant that limits a large fraction of the space to ${\mathcal{O}}(\min(r,n/s))$ for a text of length

and a given parameter

, at the expense of multiplying by

the time per occurrence reported.

Abstract

The

-index represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude in query time. Its space usage,

where

is the number of runs in the Burrows--Wheeler Transform of the text, is however higher than Lempel--Ziv (LZ) and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. We introduce the

-index, a variant that limits the space to

for a text of length

and a given parameter

, at the expense of multiplying by

the time per occurrence reported. The

-index is obtained subsampling the text positions indexed by the

-index, being still able to support pattern matching with guaranteed performance. Our experiments show that the theoretical analysis falls short in describing the practical advantages of the

-index, because it performs much better on real texts than on synthetic ones: the

-index retains the performance of the

-index while using 1.5--4.0 times less space, sharply outperforming {\em virtually every other} compressed index on repetitive texts in both time and space. Only a particular LZ-based index uses less space than the

-index, but it is an order of magnitude slower. Our second contribution are the

-csa and

-csa indexes. Just like the

-index adapts the well-known FM-Index to repetitive texts, the

-csa adapts Sadakane's Compressed Suffix Array (CSA) to this case. We show that the principles used on the

-index turn out to fit naturally and efficiently in the CSA framework. The

-csa is the corresponding subsampled version of the

-csa. While the CSA performs better than the FM-Index on classic texts with alphabets larger than DNA, we show that the

-csa outperforms the

-index on repetitive texts over those larger alphabets and some DNA texts as well.

Paper Structure