Table of Contents
Fetching ...

Tag arrays

Travis Gagie

TL;DR

If a repetitive text and such a property and the tags in their characters' BWT order are considered, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.

Abstract

The Burrows-Wheeler Transform (BWT) moves characters with similar contexts in a text together, where a character's context consists of the characters immediately following it. We say that a property has contextual locality if characters with similar contexts tend to have the same or similar values (``tags'') of that property. We argue that if we consider a repetitive text and such a property and the tags in their characters' BWT order, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.

Tag arrays

TL;DR

If a repetitive text and such a property and the tags in their characters' BWT order are considered, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.

Abstract

The Burrows-Wheeler Transform (BWT) moves characters with similar contexts in a text together, where a character's context consists of the characters immediately following it. We say that a property has contextual locality if characters with similar contexts tend to have the same or similar values (``tags'') of that property. We argue that if we consider a repetitive text and such a property and the tags in their characters' BWT order, then the resulting string -- the text and property's {\em tag array} -- will be run-length compressible either directly or after some minor manipulation.

Paper Structure

This paper contains 3 figures.

Figures (3)

  • Figure 1: A periodic string (left) and its BWT (right), both written as matrices.
  • Figure 2: A toy alignment (first row, left) and its characters' column numbers in the alignment (first row, right), its LCP values measuring to the ends of the strings (second row, left), its ILCP values (second row, right), its characters' row numbers in the alignment (third row, left), its characters' positions in the concatenation of the strings (third row, right), and the PLCP values for the concatenation of the strings (fourth row). The first three grids of numbers have run-like structure in column-major order and the second three grids of numbers have run-like structure in row-major order.
  • Figure 3: The information from Figure \ref{['fig:alignment']} but in BWT order instead of text order. For ease of presentation on one page, we have also rotated the information 90 degrees --- so what should be run-like structure in row-major order according to our arguments is run-like structure in column-major order here. The first column between the lines is the characters in the alignment in BWT order --- that is, the BWT itself --- and the other are the characters' columns in the alignment, the LCP values measuring to the end of the string, the ILCP, the characters' rows in the alignment, the characters' positions in the concatenation of the strings (now permuted into the suffix array), and the PLCP values (now permuted into the LCP array). The columns to the right of the second line are the differentially compressed suffix array and LCP array. The first four columns between the lines are visibly run-length compressible --- although this does not scale for the third column --- while the fifth and the two columns to the right of the second line display some repetitive structure.