Table of Contents
Fetching ...

Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

Amine M. Remita, Abdoulaye Baniré Diallo

TL;DR

The paper tackles alignment-free viral sequence classification under challenges such as recombination, mutation, and fragmented sequencing data. It provides an exhaustive benchmark of linear models—generative (multinomial Bayes and Markov chains) and discriminative (logistic regression and linear SVM)—using k-mer count representations, evaluated on complete and partial Hepatitis C virus genomes with varying $k$-mer lengths and regularization. Key findings show that Bayesian smoothing substantially improves generative models at larger $k$, while discriminative models achieve near-perfect genotyping and strong subtyping on complete genomes; fragment evaluations reveal sensitivity to fragment size and regularization, with longer fragments and moderate $k$-mer lengths boosting robustness. The work delivers a reproducible framework and benchmark data for robust alignment-free virus genome classification and sets the stage for extending to other viruses beyond Hepatitis C.

Abstract

Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.

Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses

TL;DR

The paper tackles alignment-free viral sequence classification under challenges such as recombination, mutation, and fragmented sequencing data. It provides an exhaustive benchmark of linear models—generative (multinomial Bayes and Markov chains) and discriminative (logistic regression and linear SVM)—using k-mer count representations, evaluated on complete and partial Hepatitis C virus genomes with varying -mer lengths and regularization. Key findings show that Bayesian smoothing substantially improves generative models at larger , while discriminative models achieve near-perfect genotyping and strong subtyping on complete genomes; fragment evaluations reveal sensitivity to fragment size and regularization, with longer fragments and moderate -mer lengths boosting robustness. The work delivers a reproducible framework and benchmark data for robust alignment-free virus genome classification and sets the stage for extending to other viruses beyond Hepatitis C.

Abstract

Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genotyping and subtyping partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genotyping and subtyping), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.

Paper Structure

This paper contains 17 sections, 11 equations, 3 figures, 7 tables, 2 algorithms.

Figures (3)

  • Figure 1: Averaged weighted F-measures of generative models tested on complete genomes. Filled regions correspond to the mean $\pm$ standard deviation of weighted F-measures of cross-validation iterations.
  • Figure 2: Averaged weighted F-measures of generative and discriminative models tested on different fragment lengths at subtyping (HCVSUBCG dataset). Filled regions correspond to the mean $\pm$ standard deviation of weighted F-measures of cross-validation iterations.
  • Figure A.1: Averaged weighted F-measures of generative and discriminative models tested on different fragment lengths at genotyping (HCVGENCG dataset). Filled regions correspond to the mean $\pm$ standard deviation of weighted F-measures of cross-validation iterations.