Table of Contents
Fetching ...

Real-time raw signal genomic analysis using fully integrated memristor hardware

Peiyi He, Shengbo Wang, Ruibin Mao, Sebastian Siegel, Giacomo Pedretti, Jim Ignowski, John Paul Strachan, Ruibang Luo, Can Li

TL;DR

A memristor-based hardware–software codesign that processes raw sequencer signals directly in analog memory, combining the two separated steps, and achieves substantial improvements in speed and efficiency, enabling real-time, on-site genomic analysis.

Abstract

Advances in third-generation sequencing have enabled portable and real-time genomic sequencing, but real-time data processing remains a bottleneck, hampering on-site genomic analysis due to prohibitive time and energy costs. These technologies generate a massive amount of noisy analog signals that traditionally require basecalling and digital mapping, both demanding frequent and costly data movement on von Neumann hardware. To overcome these challenges, we present a memristor-based hardware-software co-design that processes raw sequencer signals directly in analog memory, effectively combining the separated basecalling and read mapping steps. Here we demonstrate, for the first time, end-to-end memristor-based genomic analysis in a fully integrated memristor chip. By exploiting intrinsic device noise for locality-sensitive hashing and implementing parallel approximate searches in content-addressable memory, we experimentally showcase on-site applications including infectious disease detection and metagenomic classification. Our experimentally-validated analysis confirms the effectiveness of this approach on real-world tasks, achieving a state-of-the-art 97.15% F1 score in virus raw signal mapping, with 51x speed up and 477x energy saving compared to implementation on a state-of-the-art ASIC. These results demonstrate that memristor-based in-memory computing provides a viable solution for integration with portable sequencers, enabling truly real-time on-site genomic analysis for applications ranging from pathogen surveillance to microbial community profiling.

Real-time raw signal genomic analysis using fully integrated memristor hardware

TL;DR

A memristor-based hardware–software codesign that processes raw sequencer signals directly in analog memory, combining the two separated steps, and achieves substantial improvements in speed and efficiency, enabling real-time, on-site genomic analysis.

Abstract

Advances in third-generation sequencing have enabled portable and real-time genomic sequencing, but real-time data processing remains a bottleneck, hampering on-site genomic analysis due to prohibitive time and energy costs. These technologies generate a massive amount of noisy analog signals that traditionally require basecalling and digital mapping, both demanding frequent and costly data movement on von Neumann hardware. To overcome these challenges, we present a memristor-based hardware-software co-design that processes raw sequencer signals directly in analog memory, effectively combining the separated basecalling and read mapping steps. Here we demonstrate, for the first time, end-to-end memristor-based genomic analysis in a fully integrated memristor chip. By exploiting intrinsic device noise for locality-sensitive hashing and implementing parallel approximate searches in content-addressable memory, we experimentally showcase on-site applications including infectious disease detection and metagenomic classification. Our experimentally-validated analysis confirms the effectiveness of this approach on real-world tasks, achieving a state-of-the-art 97.15% F1 score in virus raw signal mapping, with 51x speed up and 477x energy saving compared to implementation on a state-of-the-art ASIC. These results demonstrate that memristor-based in-memory computing provides a viable solution for integration with portable sequencers, enabling truly real-time on-site genomic analysis for applications ranging from pathogen surveillance to microbial community profiling.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Real time genomic analysis in memristor crossbar and memristor CAM.a, Conventional pipeline for nanopore genomic analysis. Nanopore sequencing generates noisy analog raw signals that must first be translated into precise nucleotide base pairs (basecalling). These short base pairs are then searched within a large reference genome database to identify their locations (read mapping). Due to the separate memory unit and compute unit of current von Neumann architecture, these two steps require frequent data movement, resulting in significant time and energy waste. b, Our in-memory computing software-hardware co-design for real-time genomic analysis. Our approach eliminates these inefficiencies by directly aligning analog raw signals with the analog reference genome. We hash the analog signals into binary features and perform parallel searches within memory, bypassing the need for separate basecalling and read mapping steps. c Efficient real-time analysis pipeline. This in-memory computing system delivers immediate genomic information with low latency and high energy efficiency, enabling rapid on-site sequencing and analysis for critical applications such as cancer diagnostics, infectious disease detection, and scientific discovery. TE, top electrode; BE, bottle electrode; ML, match line; SLs, search lines.
  • Figure 2: In-memory raw signal analysis pipeline with hardware-software co-design.a, Reference data preprocessing. According to the k-mer model released by ONT, the reference is converted into expected analog vectors representing the ideal currents when the given reference DNA sequences are in the nanopore. b, Input signal data preprocessing. The analog event value is calculated from the mean of regions between two sudden change boundaries detected by t-test. Each boundary approximately indicates that a new DNA molecule goes through the nanopore. This rough and simple segmentation will introduce many errors, which should be tolerated in following steps. c, Locality sensitive hashing via random hyperplanes and its hardware implementation. Analog vectors are transformed into binary feature vectors by projecting them onto many random hyperplanes, with their positions relative to each hyperplane determining the binary output. Random memristor crossbar arrays can efficiently implement this algorithm by utilizing the intrinsic noise of memristors. d, Approximate in-memory search engine, content addressable memory, performs fuzzy seed-and-vote algorithm. The extracted feature vectors from the reference genome are sequentially stored in each memory row, with two different conductance states in memristor encoding a binary '0' or '1'. After comparing all input LSH seeds with all the stored memory, the matched location should accumulate obviously more votes than other locations.
  • Figure 3: Experimental implementation for memristor based real-time genomic analysis.a, The figure of our fully integrated memristor chip with digital and analog peripheral circuits. b, The optical figure of one memristor crossbar array. c, The cross section TEM figure of one nanoscale memristor device. d, The zero-mean $40 \times 32$ random array generated by subtracting adjacent columns of a $40 \times 64$ conductance matrix. e, The random conductance distribution of our memristor arrays. f, The corresponding random conductance difference distribution in our memristor arrays with rough zero-mean conductance. g, The ideal target virus conductance $64 \times 256$ array of CAM and experimental $64 \times 256$ virus conductance array of CAM. h, The detailed conductance distribution of binary experimental virus CAM array.
  • Figure 4: Experimental results for infectious virus detection.a, Infectious disease detection pipeline. Scientists sequence all the species in the potential virus environment and align all raw signals to pre-stored virus reference on-site for real-time analysis. Once emerging infectious virus is detected, strategies should be promptly implemented to prevent economic and human losses. b, The recall, precision and F1 surface as CAM threshold and votes threshold. The best F1 score 96.36% is achieved when the recall and precision are well balanced. c, The distribution of votes with a fixed CAM threshold of 16. Reads that match the pre-stored virus reference will have more votes, so reads with votes larger than seven will be regarded as COVID virus.
  • Figure 5: Experimental results for metagenomic classification.a, On-site relative abundance estimation. Raw signals in metagenomics environment will be efficiently aligned to pre-stored genome references. Each raw signal will be accurately assigned to its corresponding species. b, The true abundance and experimentally estimated abundance in relative abundance estimation experiments. c, The detailed confusion matrix of relative abundance estimation experiments. d, The normalized confusion matrix of relative abundance estimation experiments.
  • ...and 1 more figures