Table of Contents
Fetching ...

Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu

TL;DR

The paper tackles the problem of classifying VLP stoichiometry directly from protein sequences using an interpretable ML pipeline. It curates a balanced dataset of 60-mer and 180-mer homomeric VLPs, compares encoding schemes (one-hot vs integer-label) and maps (CHARPROTSET vs knowledge-based clusters) within linear classifiers, and analyzes feature weights to extract biological insights. Key contributions include publicly releasing the dataset, demonstrating that one-hot encoding with linear models improves classification performance, and identifying residue-level features with structural interpretations (e.g., MS2) that may influence assembly. The approach offers a faster, data-driven route to guide VLP design for vaccines, with clear interpretability and directions for future expansion to more diverse data and multimodal inputs.

Abstract

Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.

Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning

TL;DR

The paper tackles the problem of classifying VLP stoichiometry directly from protein sequences using an interpretable ML pipeline. It curates a balanced dataset of 60-mer and 180-mer homomeric VLPs, compares encoding schemes (one-hot vs integer-label) and maps (CHARPROTSET vs knowledge-based clusters) within linear classifiers, and analyzes feature weights to extract biological insights. Key contributions include publicly releasing the dataset, demonstrating that one-hot encoding with linear models improves classification performance, and identifying residue-level features with structural interpretations (e.g., MS2) that may influence assembly. The approach offers a faster, data-driven route to guide VLP design for vaccines, with clear interpretability and directions for future expansion to more diverse data and multimodal inputs.

Abstract

Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.

Paper Structure

This paper contains 19 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the development of an interpretable data-driven pipeline for VLP stoichiometry classification. Protein sequences from the PDB bank are curated into a balanced dataset of 60-mers and 180-mers, and subsequently processed through the classification pipeline: feature encoding with selected maps and methods, model training using linear machine learning models, and evaluation based on performance metrics, model interpretation and biological analysis.
  • Figure 2: Heatmaps of model weights from RC with one-hot encoding at the amino acid level. The first 10 positions are shown in subfigure (a) using the CHARPROTSET map and in subfigure (b) using the knowledge-based clusters map. The colour scale indicates contributions to stoichiometry classification: red signifies a positive contribution to classifying a protein sequence as a 180-mer, and blue signifies a positive contribution to classifying it as a 60-mer.
  • Figure 3: Position-level model weights of the first 600 amino acids in protein sequences from RC with one-hot encoding on the CHARPROTSET. Each point indicates the average positional weight, with error bars representing the standard deviation.
  • Figure 4: Comparison of four positional influence methods (a-d) on AUROC across 1–40% of positions selected. The highest AUROC for each method is starred and annotated with its percentage and score. Each point shows the mean AUROC with error bars for standard deviation.
  • Figure 5: Visualisation of positional weights from RC with one-hot encoding on the CHARPROTSET mapped to the MS2 structure (PDB: 2MS2). In subfigure (a), the complete structure, comprising 180 subunits, is colour-coded according to positional influence scores, using a gradient from grey (low influence) to red (high influence). Subfigure (b) highlights one subunit in orange to illustrate its interactions with neighbouring protein subunits.