Generalized compression and compressive search of large datasets

Morgan E. Prior; Thomas Howard; Emily Light; Najib Ishaq; Noah M. Daniels

Generalized compression and compressive search of large datasets

Morgan E. Prior, Thomas Howard, Emily Light, Najib Ishaq, Noah M. Daniels

TL;DR

panCAKES addresses the need for sub-linear search on massive datasets by enabling exact $k$-NN and $\rho$-NN queries directly on compressed data. It achieves this with a CLAM-tree based hierarchical clustering and a mixed unitary/recursive compression scheme that encodes points as differences from cluster centers, with the encoding memory proportional to the distance between points. The authors show that compression can match gzip on many datasets while still supporting exact search, albeit with some slowdown in query time depending on the dataset and distance function. The work provides a practical, open-source, general-purpose framework for compressive search on data that obeys the manifold hypothesis and demonstrates the approach on genomics, proteomics, and set data.

Abstract

The Big Data explosion has necessitated the development of search algorithms that scale sub-linearly in time and memory. While compression algorithms and search algorithms do exist independently, few algorithms offer both, and those which do are domain-specific. We present panCAKES, a novel approach to compressive search, i.e., a way to perform $k$-NN and $ρ$-NN search on compressed data while only decompressing a small, relevant, portion of the data. panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently. panCAKES is generic over any distance function for which the distance between two points is proportional to the memory cost of storing an encoding of one in terms of the other. This property holds for many widely-used distance functions, e.g. string edit distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures (Jaccard, Dice, etc.). We benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data. We compare compression ratios to gzip, and search performance between the compressed and uncompressed versions of the same dataset. panCAKES achieves compression ratios close to those of gzip, while offering sub-linear time performance for $k$-NN and $ρ$-NN search. We conclude that panCAKES is an efficient, general-purpose algorithm for exact compressive search on large datasets that obey the manifold hypothesis. We provide an open-source implementation of panCAKES in the Rust programming language.

Generalized compression and compressive search of large datasets

TL;DR

panCAKES addresses the need for sub-linear search on massive datasets by enabling exact

-NN and

-NN queries directly on compressed data. It achieves this with a CLAM-tree based hierarchical clustering and a mixed unitary/recursive compression scheme that encodes points as differences from cluster centers, with the encoding memory proportional to the distance between points. The authors show that compression can match gzip on many datasets while still supporting exact search, albeit with some slowdown in query time depending on the dataset and distance function. The work provides a practical, open-source, general-purpose framework for compressive search on data that obeys the manifold hypothesis and demonstrates the approach on genomics, proteomics, and set data.

Abstract

-NN and

-NN search on compressed data while only decompressing a small, relevant, portion of the data. panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently. panCAKES is generic over any distance function for which the distance between two points is proportional to the memory cost of storing an encoding of one in terms of the other. This property holds for many widely-used distance functions, e.g. string edit distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures (Jaccard, Dice, etc.). We benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data. We compare compression ratios to gzip, and search performance between the compressed and uncompressed versions of the same dataset. panCAKES achieves compression ratios close to those of gzip, while offering sub-linear time performance for

-NN and

-NN search. We conclude that panCAKES is an efficient, general-purpose algorithm for exact compressive search on large datasets that obey the manifold hypothesis. We provide an open-source implementation of panCAKES in the Rust programming language.

Paper Structure (21 sections, 9 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 9 equations, 4 figures, 4 tables, 2 algorithms.

Introduction
Methods
Building the CLAM Tree
Compression
Search
Datasets And Benchmarking
SILVA 18S
GreenGenes
GreenGenes 12.10
GreenGenes 13.8
PDB-seq
Kosarak
MovieLens-10M
Results
Scaling Behavior of Cluster Radii
...and 6 more sections

Figures (4)

Figure 1: A cluster tree with a mix of unitarily and recursively compressed clusters. A green edge from a parent to a child indicates that the center of the child will be encoded in terms of the center of the parent. A red-shaded cluster indicates that the cluster will be unitarily compressed. A dashed edge between a unitarily compressed cluster and a child indicates that the child will be deleted during Algorithm \ref{['alg:methods:compress']}. Notably, unitarily compressed clusters do not all occur at the same depth in the tree. The exact depth at which recursive compression becomes more efficient than unitary compression varies with the structure at different regions of the manifold.
Figure 2: An alternate view of the cluster tree from Figure \ref{['fig:results:unitary1']}. A green dashed edge between points indicates recursive compression. For example, the dashed green edges $\overline{yj}$ and $\overline{yk}$ indicate that $j$ and $k$ were recursively encoded in terms of $y$. A solid red edge between points indicates unitary compression. For example, the red edge $\overline{xi}$ indicates that $i$ is encoded in terms of its cluster center, $x$.
Figure 3: Scaling behavior of radii on a two dimensional disk of uniformly distributed points representing the worst-case scenario for a two-dimensional distribution. The root cluster $C_0$ has a center $o_0$ and radius radius $R_0$. After one application of Partition (Algorithm \ref{['alg:methods:partition']}), we have a child cluster $C_1$, with radius $R_1$ and center $o_1$. The right triangle formed by $o_0$, $o_1$, and $y_+$ in $C_1$ shows that $R_0 < R_1$. Hence, the radius of a child cluster can be larger than that of its parent. However, after another application of Algorithm \ref{['alg:methods:partition']}, we have consumed both orthogonal axes, as shown in $C_2$. Now, clearly $R_2 < R_0$.
Figure 4: Compression cost breakdown assuming LFD = 3 and stride = 1. The leaf clusters within the stride (i.e., clusters at depth 4), shaded red, are compressed unitarily. Their ancestors, connected by green edges, are compressed using recursively. The recursive cost $T_R$ is the sum of the costs of encoding the green edges. The unitary cost $T_U$ is the sum of the costs of encoding all non-center points in the red clusters.

Generalized compression and compressive search of large datasets

TL;DR

Abstract

Generalized compression and compressive search of large datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (4)