Table of Contents
Fetching ...

Neighborhood density estimation using space-partitioning based hashing schemes

Aashi Jindal

TL;DR

This work presents FiRE, a linear-time, hashing-based approach for identifying rare cell sub-populations in large-scale single-cell RNA sequencing data, and FiRE.1, an extension using projection hashing to detect both local and global outliers. It further introduces Enhash, a fast streaming ensemble learner for concept drift detection that updates in O(1) per sample using projection hashing and a forgetting mechanism. Across extensive synthetic and real datasets, FiRE.1 consistently achieves superior performance on outlier detection tasks, while Enhash demonstrates robust drift adaptation with low memory and fast runtimes. Collectively, the methods enable scalable anomaly and drift detection in high-dimensional, dynamic data regimes, with practical impact in bioinformatics and streaming analytics.

Abstract

This work introduces FiRE/FiRE.1, a novel sketching-based algorithm for anomaly detection to quickly identify rare cell sub-populations in large-scale single-cell RNA sequencing data. This method demonstrated superior performance against state-of-the-art techniques. Furthermore, the thesis proposes Enhash, a fast and resource-efficient ensemble learner that uses projection hashing to detect concept drift in streaming data, proving highly competitive in time and accuracy across various drift types.

Neighborhood density estimation using space-partitioning based hashing schemes

TL;DR

This work presents FiRE, a linear-time, hashing-based approach for identifying rare cell sub-populations in large-scale single-cell RNA sequencing data, and FiRE.1, an extension using projection hashing to detect both local and global outliers. It further introduces Enhash, a fast streaming ensemble learner for concept drift detection that updates in O(1) per sample using projection hashing and a forgetting mechanism. Across extensive synthetic and real datasets, FiRE.1 consistently achieves superior performance on outlier detection tasks, while Enhash demonstrates robust drift adaptation with low memory and fast runtimes. Collectively, the methods enable scalable anomaly and drift detection in high-dimensional, dynamic data regimes, with practical impact in bioinformatics and streaming analytics.

Abstract

This work introduces FiRE/FiRE.1, a novel sketching-based algorithm for anomaly detection to quickly identify rare cell sub-populations in large-scale single-cell RNA sequencing data. This method demonstrated superior performance against state-of-the-art techniques. Furthermore, the thesis proposes Enhash, a fast and resource-efficient ensemble learner that uses projection hashing to detect concept drift in streaming data, proving highly competitive in time and accuracy across various drift types.

Paper Structure

This paper contains 70 sections, 40 equations, 25 figures, 14 tables, 4 algorithms.

Figures (25)

  • Figure 1: Overall categorization of well-known unsupervised anomaly detection algorithms. The three broad categories are statistical, sub-space based, and nearest-neighbor based.
  • Figure 2: Overview of fire. The first step is to assign each cell to a hash-code. As numerous similar cells can share the same hash-code, it is possible to think of a hash-code as an imagined bucket. The phase of creating the hash-code is repeated $L$ times to test the reliability of rarity estimates. The chance that any point will fall into the bucket of a given cell $i$ and estimator $l$ is calculated as $p_{il}$. These probabilities are combined in the algorithm's second phase to get an estimate of how rare each cell is.
  • Figure 3: Stability of fire. (a),(c) RMS difference in values of FiRE-score of every cell between two successive estimators. For calculation of RMS, FiRE-score is averaged across multiple seeds and normalized by the value of L. (b),(d) RMS difference in values of FiRE-score between two successive values of M. For calculation of RMS, FiRE-score is averaged across multiple seeds and normalized by the value of M. (a)-(b) RMS has been shown on a simulated dataset consisting of a mixture of Jurkat and 293T cells Zheng. (c)-(d) RMS has been shown on $\sim$68k Peripheral Blood Mononuclear Cells (PBMCs) Zheng.
  • Figure 4: Performance evaluation of fire on Peripheral Blood Mononuclear Cells (PBMCs). (a) tsne based 2D embedding of the data with color coded cluster identities as reported by Zheng and colleagues Zheng. (b) Rare population identified by fire using IQR-thresholding-criteria. (c) Heat map of FiRE scores for the individual PBMCs. The cluster of megakaryocytes (0.3%), the rarest of all the cell types are assigned the highest FiRE scores.
  • Figure 5: In the $\sim$68k PBMC data Zheng, the appearance of minor cell populations with varying degrees of rarity is accompanied by a rise in the number of chosen rare cells. Figures (a)-(c) demonstrate, respectively, the top 0.25%, 2%, and 5% cells chosen based on FiRE scores.
  • ...and 20 more figures