Table of Contents
Fetching ...

Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

Courtney Golden, Dan Ilan, Nicholas Cebry, Christopher Batten

TL;DR

The paper tackles speeding up the filtering stage of reference-guided DNA read mapping by offloading Myers' bit-parallel edit-distance calculation to a commercial compute-in-SRAM accelerator (Gemini APU). It provides a microcode-level architectural and programming model for the Gemini APU and demonstrates how the Myers' algorithm can be mapped to this massively parallel, bit-sliced hardware to process thousands of candidate alignments in parallel. The results show substantial end-to-end speedups (average $14.1\times$, up to $24.1\times$ for some queries) and identify that kernel computation and data movement dominate runtime, with clear scalability as candidate counts grow. The work suggests compute-in-SRAM is well-suited for DNA read filtering and could influence future accelerator designs and genomics pipelines, while outlining opportunities for multicore expansion and longer-read scenarios.

Abstract

DNA sequence alignment is an important workload in computational genomics. Reference-guided DNA assembly involves aligning many read sequences against candidate locations in a long reference genome. To reduce the computational load of this alignment, candidate locations can be pre-filtered using simpler alignment algorithms like edit distance. Prior work has explored accelerating filtering on simulated compute-in-DRAM, due to the massive parallelism of compute-in-memory architectures. In this paper, we present work-in-progress on accelerating filtering using a commercial compute-in-SRAM accelerator. We leverage the recently released Gemini accelerator platform from GSI Technology, which is the first, to our knowledge, commercial-scale compute-in-SRAM system. We accelerate the Myers' bit-parallel edit distance algorithm, producing average speedups of 14.1x over single-core CPU performance. Individual query/candidate alignments produce speedups of up to 24.1x. These early results suggest this novel architecture is well-suited to accelerating the filtering step of sequence-to-sequence DNA alignment.

Accelerating Seed Location Filtering in DNA Read Mapping Using a Commercial Compute-in-SRAM Architecture

TL;DR

The paper tackles speeding up the filtering stage of reference-guided DNA read mapping by offloading Myers' bit-parallel edit-distance calculation to a commercial compute-in-SRAM accelerator (Gemini APU). It provides a microcode-level architectural and programming model for the Gemini APU and demonstrates how the Myers' algorithm can be mapped to this massively parallel, bit-sliced hardware to process thousands of candidate alignments in parallel. The results show substantial end-to-end speedups (average , up to for some queries) and identify that kernel computation and data movement dominate runtime, with clear scalability as candidate counts grow. The work suggests compute-in-SRAM is well-suited for DNA read filtering and could influence future accelerator designs and genomics pipelines, while outlining opportunities for multicore expansion and longer-read scenarios.

Abstract

DNA sequence alignment is an important workload in computational genomics. Reference-guided DNA assembly involves aligning many read sequences against candidate locations in a long reference genome. To reduce the computational load of this alignment, candidate locations can be pre-filtered using simpler alignment algorithms like edit distance. Prior work has explored accelerating filtering on simulated compute-in-DRAM, due to the massive parallelism of compute-in-memory architectures. In this paper, we present work-in-progress on accelerating filtering using a commercial compute-in-SRAM accelerator. We leverage the recently released Gemini accelerator platform from GSI Technology, which is the first, to our knowledge, commercial-scale compute-in-SRAM system. We accelerate the Myers' bit-parallel edit distance algorithm, producing average speedups of 14.1x over single-core CPU performance. Individual query/candidate alignments produce speedups of up to 24.1x. These early results suggest this novel architecture is well-suited to accelerating the filtering step of sequence-to-sequence DNA alignment.
Paper Structure (19 sections, 1 equation, 6 figures, 5 tables)

This paper contains 19 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: APU Architecture -- (a) System Overview, (b) APU Core Logical View, (c) Bank Physical View, (d) Bit Processor Circuitry. CP = control processor, VCU = vector command unit, VXU = vector execution unit, VRF = vector register file, VMRF = vector memory register file, SPM = scratchpad memory, GVL = global vertical latch, R/W = read/write logic, RBL = read bitline, WBL = write bitline, WBLB = write bitline bar, REx = read-enable for bit x, WEx = write-enable for bit x, RLN = north read latch. Note: exact bit-slice organization is not published by GSI.
  • Figure 2: Pseudocode and Microcode for Myers' Algorithm. (a) Pseudocode for alignment of a single read against a single seed from myers-algorithm-1999. (b-g) Microcode fragments for Myers' algorithm implementation: (b) bitwise OR, used in lines 13-15 and 28 of the pseudocode; (c) vector set-equals, used in a multi-line equivalent to line 12 of the pseudocode on Gemini; (d) sets all elements to scalar value, used for data initialization on lines 7-9 and 12; (f) saves the last bit (bit 16) that is currently stored in RL, used directly after the functions appearing in lines 23 and 26 of the pseudocode; (f) left bitwise shift with a carry-in, used in lines 22 and 25; (g) ripple-carry elementwise addition for line 14, although the actual implementation of addition used in our algorithm uses a more sophisticated carry-select approach not shown here. vrd = destination vector register, vrs = source vector register, vrd_m = a one-hot encoding for which of the 16 bits of each element of MASK_REG corresponds to the desired mask. rs, in_value, and shift_in are all 16-bit integer operands. b16 is a 16-bit value with only the 16th bit set, and bit is a 16-bit mask for a desired bit.
  • Figure 3: Data Layout of Myers' Algorithm on the APU
  • Figure 4: Cumulative Distribution of Candidates Per Query -- Distribution of the number of candidates produced in each length-range (measured in base pairs) for the 300-base pair reads produced by Mason simulation.
  • Figure 5: Speedup of APU over CPU for Each Query
  • ...and 1 more figures