Table of Contents
Fetching ...

SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Sequence Analysis

Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

TL;DR

SAGe tackles the data preparation bottleneck in large-scale genome sequence analysis by introducing an algorithm-architecture co-design that delivers highly compressed genomic data with lightweight, streaming decompression. The approach leverages genomic data properties to encode mismatch information and matching positions in hardware-friendly arrays, paired with a minimal three-unit decompression engine and a tailored storage layout, enabling seamless integration with accelerators and near-data processing. Empirical results show end-to-end speedups up to 32x and energy reductions up to 34x when paired with GEM or GenStore, while maintaining compression ratios comparable to genomics-specific compressors. This work highlights the critical role of data preparation in accelerator efficacy and provides a practical path to unlocking full performance and energy benefits across diverse genome analysis systems.

Abstract

Genome sequence analysis, which examines the DNA sequences of organisms, drives advances in many critical medical and biotechnological fields. Given its importance and the exponentially growing volumes of genomic sequence data, there are extensive efforts to accelerate genome sequence analysis. In this work, we demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome sequence analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored in compressed form and needs to be first decompressed and formatted before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic datasets to co-design (i) a lossless (de)compression algorithm, (ii) hardware that decompresses data with lightweight operations and efficient streaming accesses, (iii) storage data layout, and (iv) interface commands to access data. SAGe is highly versatile, as it supports datasets from different sequencing technologies and species. Due to its lightweight design, SAGe can be seamlessly integrated with a broad range of hardware accelerators for genome sequence analysis to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome sequence analysis accelerators by 3.0x-32.1x and 13.0x-34.0x, respectively, compared to when the accelerators rely on state-of-the-art software and hardware decompression tools.

SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Sequence Analysis

TL;DR

SAGe tackles the data preparation bottleneck in large-scale genome sequence analysis by introducing an algorithm-architecture co-design that delivers highly compressed genomic data with lightweight, streaming decompression. The approach leverages genomic data properties to encode mismatch information and matching positions in hardware-friendly arrays, paired with a minimal three-unit decompression engine and a tailored storage layout, enabling seamless integration with accelerators and near-data processing. Empirical results show end-to-end speedups up to 32x and energy reductions up to 34x when paired with GEM or GenStore, while maintaining compression ratios comparable to genomics-specific compressors. This work highlights the critical role of data preparation in accelerator efficacy and provides a practical path to unlocking full performance and energy benefits across diverse genome analysis systems.

Abstract

Genome sequence analysis, which examines the DNA sequences of organisms, drives advances in many critical medical and biotechnological fields. Given its importance and the exponentially growing volumes of genomic sequence data, there are extensive efforts to accelerate genome sequence analysis. In this work, we demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome sequence analysis accelerators: the data preparation bottleneck, where genomic sequence data is stored in compressed form and needs to be first decompressed and formatted before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic sequence data. The key challenge is to improve data preparation performance while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. We address this challenge by leveraging key properties of genomic datasets to co-design (i) a lossless (de)compression algorithm, (ii) hardware that decompresses data with lightweight operations and efficient streaming accesses, (iii) storage data layout, and (iv) interface commands to access data. SAGe is highly versatile, as it supports datasets from different sequencing technologies and species. Due to its lightweight design, SAGe can be seamlessly integrated with a broad range of hardware accelerators for genome sequence analysis to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome sequence analysis accelerators by 3.0x-32.1x and 13.0x-34.0x, respectively, compared to when the accelerators rely on state-of-the-art software and hardware decompression tools.

Paper Structure

This paper contains 31 sections, 18 figures, 3 tables, 1 algorithm.

Figures (18)

  • Figure 1: Effect of data preparation (i.e., decompressing and formatting genomic sequence data before analysis) on genome analysis performance.
  • Figure 2: Overview of a typical genomic workflow.
  • Figure 3: Overview of genomics-specific compression.
  • Figure 4: End-to-end throughput for different read sets.
  • Figure 5: High-level overview of SAGe's (a) data preparation and (b) data compression and storage.
  • ...and 13 more figures