FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

Yuanjian Liu; Huihao Luo; Zhijun Han; Yao Hu; Yehui Yang; Kyle Chard; Sheng Di; Ian Foster; Jiesheng Wu

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

Yuanjian Liu, Huihao Luo, Zhijun Han, Yao Hu, Yehui Yang, Kyle Chard, Sheng Di, Ian Foster, Jiesheng Wu

TL;DR

FastqZip introduces a reference-based FASTQ compressor that reorders reads, employs a refined sequence matching framework, and enables lossy quality-score compression, followed by final lossless compression with $\text{BSC}$ or $\text{ZPAQ}$. Its core innovations include a seed-based index with full seed-position coverage, a global+local sequence alignment that handles insertions/deletions via $\text{WFA-2}$, and a segmentation strategy that uses delta encoding and dominant quality bitmaps. Across five real datasets, FastqZip achieves about 10% higher compression ratio than Genozip, with a controllable slowdown primarily due to quality-score processing; lossy-quality options can further boost ratios. The work demonstrates strong scalability and memory efficiency on multi-core hardware, and points to future work in alternative lossless codecs and potential GPU/FPGA acceleration to further accelerate compression.

Abstract

Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

TL;DR

. Its core innovations include a seed-based index with full seed-position coverage, a global+local sequence alignment that handles insertions/deletions via

, and a segmentation strategy that uses delta encoding and dominant quality bitmaps. Across five real datasets, FastqZip achieves about 10% higher compression ratio than Genozip, with a controllable slowdown primarily due to quality-score processing; lossy-quality options can further boost ratios. The work demonstrates strong scalability and memory efficiency on multi-core hardware, and points to future work in alternative lossless codecs and potential GPU/FPGA acceleration to further accelerate compression.

Abstract

Paper Structure (17 sections, 2 equations, 15 figures, 6 tables)

This paper contains 17 sections, 2 equations, 15 figures, 6 tables.

Introduction
Related Work
DNA Sequencing Technologies
DNA Sequence Compression Algorithms
Problem Formulation
Methodology
Compression Architecture
Index Building & Loading
Sequence Alignment
Segmentation
Lossless Compression
Evaluation
Experimental Settings
Compression Performance Evaluation
Resource Consumption Analysis
...and 2 more sections

Figures (15)

Figure 1: Reference-based sequence matching process: (1) use seeds to build an index for the reference sequence; (2) find matching locations for reads on the reference sequence; (3) for unmatched bases, store the difference. Our algorithm performs better matching by storing more seeds for a higher chance of matching, and local search for insertion and deletion detection.
Figure 2: FastqZip compression architecture: The read thread must be sequential, but workers can proceed in parallel. The read buffer and write buffer allow maximum parallelism for the whole pipeline.
Figure 3: Index concept: we look for all valid seeds in the reference sequence and record their positions. There are multiple positions because the same seed may appear multiple times in different locations on the reference sequence.
Figure 4: Index storage: The range index and forward index arrays together store the reference positions for all seeds. A seed can be uniquely mapped to an index $i$ in the index range array. The value in index_range[$i$] is the starting index in the index forward array, and the value in index_range[$i+1$] is the index after the ending index in the index forward array.
Figure 5: Alignment procedure: when multiple seeds exist on a single read, if a match exists, two seeds should match to the same starting position on the reference. If the candidate sequence on the reference has a very low Hamming Distance against the read, it is a match. If there are the same starting positions, but the Hamming Distance is large, we use our proposed local alignment to find a match with insertion or deletion.
...and 10 more figures

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

TL;DR

Abstract

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (15)