Table of Contents
Fetching ...

Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling

Riselda Kodra, Hadjer Benmeziane, Irem Boybat, William Andrew Simon

TL;DR

The paper tackles the bottleneck in metagenomic profiling by integrating multi-class species classification directly into the basecalling stage of nanopore sequencing. It introduces a multi-objective DNN built on a Bonito basecaller with a classifier head, explored in parallel and serial architectures, and trained with a loss that separately backpropagates basecalling and classification signals. The approach delivers state-of-the-art basecalling performance while achieving high multi-class per-read accuracy, reporting $92.5\%$ top-1 and $98.89\%$ top-3 on a 17-genome Wick dataset, with top-$K$ results enabling flexible speed-accuracy trade-offs. This integration can substantially accelerate downstream metagenomic profiling by reducing the number of genomes to align against, while remaining scalable to larger genome collections in future work.

Abstract

The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.

Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling

TL;DR

The paper tackles the bottleneck in metagenomic profiling by integrating multi-class species classification directly into the basecalling stage of nanopore sequencing. It introduces a multi-objective DNN built on a Bonito basecaller with a classifier head, explored in parallel and serial architectures, and trained with a loss that separately backpropagates basecalling and classification signals. The approach delivers state-of-the-art basecalling performance while achieving high multi-class per-read accuracy, reporting top-1 and top-3 on a 17-genome Wick dataset, with top- results enabling flexible speed-accuracy trade-offs. This integration can substantially accelerate downstream metagenomic profiling by reducing the number of genomes to align against, while remaining scalable to larger genome collections in future work.

Abstract

The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.

Paper Structure

This paper contains 21 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Proposed genome sequencing pipeline with species classification while basecalling.
  • Figure 2: Proposed parallel (a) and serial (b) models for the task of classification while basecalling.
  • Figure 3: Validation accuracies for basecalling and top-1 classification for the parallel and serial model architectures.
  • Figure 4: Post-alignment identity accuracy of this work vs. Bonito and SotA classifier RUBICALL Singh2024-es.
  • Figure 5: Top-K classification accuracy evolution during training of parallel model architecture.
  • ...and 2 more figures