Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning
Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu
TL;DR
The paper tackles the problem of classifying VLP stoichiometry directly from protein sequences using an interpretable ML pipeline. It curates a balanced dataset of 60-mer and 180-mer homomeric VLPs, compares encoding schemes (one-hot vs integer-label) and maps (CHARPROTSET vs knowledge-based clusters) within linear classifiers, and analyzes feature weights to extract biological insights. Key contributions include publicly releasing the dataset, demonstrating that one-hot encoding with linear models improves classification performance, and identifying residue-level features with structural interpretations (e.g., MS2) that may influence assembly. The approach offers a faster, data-driven route to guide VLP design for vaccines, with clear interpretability and directions for future expansion to more diverse data and multimodal inputs.
Abstract
Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.
