Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling
Riselda Kodra, Hadjer Benmeziane, Irem Boybat, William Andrew Simon
TL;DR
The paper tackles the bottleneck in metagenomic profiling by integrating multi-class species classification directly into the basecalling stage of nanopore sequencing. It introduces a multi-objective DNN built on a Bonito basecaller with a classifier head, explored in parallel and serial architectures, and trained with a loss that separately backpropagates basecalling and classification signals. The approach delivers state-of-the-art basecalling performance while achieving high multi-class per-read accuracy, reporting $92.5\%$ top-1 and $98.89\%$ top-3 on a 17-genome Wick dataset, with top-$K$ results enabling flexible speed-accuracy trade-offs. This integration can substantially accelerate downstream metagenomic profiling by reducing the number of genomes to align against, while remaining scalable to larger genome collections in future work.
Abstract
The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.
