Table of Contents
Fetching ...

IDentity with Locality: An ideal hash for gene sequence search

Aditya Desai, Gaurav Gupta, Tianyi Zhang, Anshumali Shrivastava

TL;DR

This work proposes a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions, which ensures both cache locality and key preservation.

Abstract

Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.

IDentity with Locality: An ideal hash for gene sequence search

TL;DR

This work proposes a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions, which ensures both cache locality and key preservation.

Abstract

Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.
Paper Structure (32 sections, 4 theorems, 28 equations, 8 figures, 4 tables, 3 algorithms)

This paper contains 32 sections, 4 theorems, 28 equations, 8 figures, 4 tables, 3 algorithms.

Key Result

Theorem 1

(general IDL construction) Let $\phi$ be drawn from a $(r_1, r_2, p_1, p_2)$ sensitive LSH family say $\mathcal{L}$ and $\rho_1, \rho_2$ be drawn from a random hash family, say $\mathcal{R}_1 : V \rightarrow [m], \mathcal{R}_2: U \rightarrow [L]$ respectively. Then the family of hash functions defin is a $(r_1, r_2, \frac{L-1}{L}p_1, \frac{L}{m} + p_2)$ sensitive and $L$ preserving IDL family.

Figures (8)

  • Figure 1: An overview of gene string tokenization process followed by BF insertion and query. The long gene string of genome is broken into kmers (base substring of length $k$) using a moving window over the string, and then each kmer is inserted into the BF. While querying, again, the input sequence is broken into kmers and the membership of each kmer is tested with the BF. If all the kmer pass the membership test, then the query is implied to be present in the corresponding genome.
  • Figure 2: Illustration of different hash functions' behavior. While RH disregards similarity in input space, LSH causes similar elements to collide. IDL, on the other hand maintains locality while discouraging collisions.
  • Figure 3: Illustration of gene sequence index and search using BF and IDL-BF. [Left] Traditional BFs causes inefficient usage of cache due to randomly mapping each subsequent kmer [Right] IDL-BF is cache-efficient, which uses the similarity of subsequent kmers to co-locate their bit signatures and thus use cache lines effectively.
  • Figure 4: [Left: LSH computation on kmer] Each kmer (k=8 in this case) is split into sub-kmers of length t (t=5) and then Min-hash is applied to this set of sub-kmers. The probability of collision of two kmers is equal to the Jaccard similarity between the two sets. [Right: Rolling min-hash] As two subsequent kmers only differ in one sub-kmer, we can reuse the hash computations from the previous kmer for the current kmer. We build a complete segment tree data-structure on the sub-kmer hashes for the first kmer. For the subsequent kmers we need to replace exactly one existing leaf-node corresponding to sub-kmer that is not present in the current kmer and replace it with the new incoming sub-kmer. Thus, for each kmer (after first one), we only need to compute one hash of sub-kmer and update $\textrm{log}(k-t)$ min values in the tree.
  • Figure 5: The impact of different sizes of IDL-BF vs BF on query time, indexing time, FPR, and cache miss rates, averaged over 5 runs. Query Time and indexing time of IDL-BF grow significantly slower when compared with BF, while achieving similar FPR. For the same BF size, IDL-BF achieves up to 41.9% and 44.3% reduction in query time and indexing time, respectively. The reductions in query and indexing time are accounted for by the reductions in cache miss rate achieved by IDL-BF. For the same BF size, IDL-BF achieves up to 76.2% and 77.0% reduction in L1 and L3 cache miss rate during querying, and up to 83.0% and 82.6% reduction in L1 and L3 cache miss rate during indexing, respectively.
  • ...and 3 more figures

Theorems & Definitions (8)

  • Definition 1: LSH Family
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2