Neighborhood density estimation using space-partitioning based hashing schemes
Aashi Jindal
TL;DR
This work presents FiRE, a linear-time, hashing-based approach for identifying rare cell sub-populations in large-scale single-cell RNA sequencing data, and FiRE.1, an extension using projection hashing to detect both local and global outliers. It further introduces Enhash, a fast streaming ensemble learner for concept drift detection that updates in O(1) per sample using projection hashing and a forgetting mechanism. Across extensive synthetic and real datasets, FiRE.1 consistently achieves superior performance on outlier detection tasks, while Enhash demonstrates robust drift adaptation with low memory and fast runtimes. Collectively, the methods enable scalable anomaly and drift detection in high-dimensional, dynamic data regimes, with practical impact in bioinformatics and streaming analytics.
Abstract
This work introduces FiRE/FiRE.1, a novel sketching-based algorithm for anomaly detection to quickly identify rare cell sub-populations in large-scale single-cell RNA sequencing data. This method demonstrated superior performance against state-of-the-art techniques. Furthermore, the thesis proposes Enhash, a fast and resource-efficient ensemble learner that uses projection hashing to detect concept drift in streaming data, proving highly competitive in time and accuracy across various drift types.
