Predicting Gene Disease Associations in Type 2 Diabetes Using Machine Learning on Single-Cell RNA-Seq Data
Maria De La Luz Lomboy Toledo, Daniel Onah
TL;DR
This study addresses identifying gene expression signatures associated with type 2 diabetes at single-cell resolution by leveraging the Mouse Islet Atlas and two supervised ML methods: PLS-DA and ETC. It processes mouse beta-cell scRNA-seq data from db/db and mSTZ models to achieve near-perfect discrimination between diabetic and healthy cells, while providing interpretable gene rankings linked to insulin secretion and stress pathways. By demonstrating both high predictive accuracy and biologically meaningful feature importance, the work offers a framework for biomarker discovery and mechanistic understanding of T2D that can inform precision medicine and translational research. The complementary use of interpretable (PLS-DA) and scalable (ETC) models strengthens confidence in shared molecular signatures across diabetic states.
Abstract
Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels due to impaired insulin production or function. Two main forms are recognized: type 1 diabetes (T1D), which involves autoimmune destruction of insulin-producing \b{eta}-cells, and type 2 diabetes (T2D), which arises from insulin resistance and progressive \b{eta}-cell dysfunction. Understanding the molecular mechanisms underlying these diseases is essential for the development of improved therapeutic strategies, particularly those targeting \b{eta}-cell dysfunction. To investigate these mechanisms in a controlled and biologically interpretable setting, mouse models have played a central role in diabetes research. Owing to their genetic and physiological similarity to humans, together with the ability to precisely manipulate their genome, mice enable detailed investigation of disease progression and gene function. In particular, mouse models have provided critical insights into \b{eta}-cell development, cellular heterogeneity, and functional failure under diabetic conditions. Building on these experimental advances, this study applies machine learning methods to single-cell transcriptomic data from mouse pancreatic islets. Specifically, we evaluate two supervised approaches identified in the literature; Extra Trees Classifier (ETC) and Partial Least Squares Discriminant Analysis (PLS-DA), to assess their ability to identify T2D-associated gene expression signatures at single-cell resolution. Model performance is evaluated using standard classification metrics, with an emphasis on interpretability and biological relevance
