HERP: Hardware for Energy Efficient and Realtime DB Search and Cluster Expansion in Proteomics
Md Mizanur Rahaman Nayan, Zheyu Li, Flavio Ponzina, Sumukh Pinge, Tajana Rosing, Azad J. Naeemi
TL;DR
HERP addresses the resource-intensive problem of database search and clustering in mass spectrometry proteomics by combining algorithmic seed-based incremental clustering with a hardware-software co-design that employs 3T2MTJ SOT-MRAM CAM for in-memory, bucketed search. The approach uses pre-clustered seeds to enable rapid incremental clustering and on-chip search, achieving approximately $20\times$ clustering speedups and up to $100\times$ end-to-end search speedups, while maintaining a high fidelity to state-of-the-art results (≈96% search overlap and ≈0.3% extra clustering error). Experimental evaluation on datasets ranging from several gigabytes to 131 GB demonstrates favorable latency and energy profiles, including energy figures like ~1.19 mJ for 2M spectra and ~1.1 µJ per 1000 queries, with bucket-level parallelism driving substantial acceleration. By minimizing data movement through compute-in-memory and leveraging robust HDC representations, HERP enables real-time proteomics analytics near acquisition hardware, offering practical benefits for researchers dealing with evolving protein variants and large spectral libraries.
Abstract
Database search and clustering are fundamental components of many data analytics problems, such as mass spectrometry-driven proteomics. Traditional full clustering and search algorithms suffer from high resource usage and long latencies. We introduce HERP, a lightweight incremental clustering method and a highly parallelizable database (DB) search platform that utilizes 3T2MTJ SOT-MRAM based CAM in 7nm technology for in-memory acceleration. A single hardware initialization using pre-clustered proteomics data allows for continuous DB searching and local re-clustering, providing a more practical and efficient alternative to clustering from scratch. Heuristics derived from the initial pre-clustered data guide the incremental process, accelerating clustering by 20x at a cost of 0.3% increase in clustering error where DB search results overlap by 96% with SOTA algorithms validating search quality. For a 131GB human genome proteomics dataset HERP setup requires 1.19mJ for 2M spectra while 1000 query search consumes only 1.1uJ at SOTA accuracy. Bucket-wise parallelization and query scheduling provides additional 100x speedup.
