Table of Contents
Fetching ...

An experimental sorting method for improving metagenomic data encoding

Diogo Pratas, Armando J. Pinho

TL;DR

This work provides a compression-based method that explores metagenomic classification and recursive filtering by similarity for sorting the order of the reads and increase the overall compression of FASTQ files from metagenomics.

Abstract

Minimizing data storage poses a significant challenge in large-scale metagenomic projects. In this paper, we present a new method for improving the encoding of FASTQ files generated by metagenomic sequencing. This method incorporates metagenomic classification followed by a recursive filter for clustering reads by DNA sequence similarity to improve the overall reference-free compression. In the results, we show an overall improvement in the compression of several datasets. As hypothesized, we show a progressive compression gain for higher coverage depth and number of identified species. Additionally, we provide an implementation that is freely available at https://github.com/cobilab/mizar and can be customized to work with other FASTQ compression tools.

An experimental sorting method for improving metagenomic data encoding

TL;DR

This work provides a compression-based method that explores metagenomic classification and recursive filtering by similarity for sorting the order of the reads and increase the overall compression of FASTQ files from metagenomics.

Abstract

Minimizing data storage poses a significant challenge in large-scale metagenomic projects. In this paper, we present a new method for improving the encoding of FASTQ files generated by metagenomic sequencing. This method incorporates metagenomic classification followed by a recursive filter for clustering reads by DNA sequence similarity to improve the overall reference-free compression. In the results, we show an overall improvement in the compression of several datasets. As hypothesized, we show a progressive compression gain for higher coverage depth and number of identified species. Additionally, we provide an implementation that is freely available at https://github.com/cobilab/mizar and can be customized to work with other FASTQ compression tools.
Paper Structure (10 sections, 2 equations, 4 figures)

This paper contains 10 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Architecture of the proposed methodology.
  • Figure 2: Compression gain in Megabytes (original compressed number of bytes minus the sorted compressed number of bytes) for the Fqzcomp, JARVIS, and LZMA compressors while varying the number of reference sequences used in the sequencing simulation.
  • Figure 3: Compression gain in Megabytes (original compressed number of bytes minus the sorted compressed number of bytes) for the Fqzcomp compressor while varying the number of reference sequences used in the sequencing simulation.
  • Figure 4: Compression gain in Megabytes (original compressed number of bytes minus the sorted compressed number of bytes) for the Fqzcomp compressor while varying the sequencing coverage depth used in the sequencing simulation.