Table of Contents
Fetching ...

Predicting Gene Disease Associations in Type 2 Diabetes Using Machine Learning on Single-Cell RNA-Seq Data

Maria De La Luz Lomboy Toledo, Daniel Onah

TL;DR

This study addresses identifying gene expression signatures associated with type 2 diabetes at single-cell resolution by leveraging the Mouse Islet Atlas and two supervised ML methods: PLS-DA and ETC. It processes mouse beta-cell scRNA-seq data from db/db and mSTZ models to achieve near-perfect discrimination between diabetic and healthy cells, while providing interpretable gene rankings linked to insulin secretion and stress pathways. By demonstrating both high predictive accuracy and biologically meaningful feature importance, the work offers a framework for biomarker discovery and mechanistic understanding of T2D that can inform precision medicine and translational research. The complementary use of interpretable (PLS-DA) and scalable (ETC) models strengthens confidence in shared molecular signatures across diabetic states.

Abstract

Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels due to impaired insulin production or function. Two main forms are recognized: type 1 diabetes (T1D), which involves autoimmune destruction of insulin-producing \b{eta}-cells, and type 2 diabetes (T2D), which arises from insulin resistance and progressive \b{eta}-cell dysfunction. Understanding the molecular mechanisms underlying these diseases is essential for the development of improved therapeutic strategies, particularly those targeting \b{eta}-cell dysfunction. To investigate these mechanisms in a controlled and biologically interpretable setting, mouse models have played a central role in diabetes research. Owing to their genetic and physiological similarity to humans, together with the ability to precisely manipulate their genome, mice enable detailed investigation of disease progression and gene function. In particular, mouse models have provided critical insights into \b{eta}-cell development, cellular heterogeneity, and functional failure under diabetic conditions. Building on these experimental advances, this study applies machine learning methods to single-cell transcriptomic data from mouse pancreatic islets. Specifically, we evaluate two supervised approaches identified in the literature; Extra Trees Classifier (ETC) and Partial Least Squares Discriminant Analysis (PLS-DA), to assess their ability to identify T2D-associated gene expression signatures at single-cell resolution. Model performance is evaluated using standard classification metrics, with an emphasis on interpretability and biological relevance

Predicting Gene Disease Associations in Type 2 Diabetes Using Machine Learning on Single-Cell RNA-Seq Data

TL;DR

This study addresses identifying gene expression signatures associated with type 2 diabetes at single-cell resolution by leveraging the Mouse Islet Atlas and two supervised ML methods: PLS-DA and ETC. It processes mouse beta-cell scRNA-seq data from db/db and mSTZ models to achieve near-perfect discrimination between diabetic and healthy cells, while providing interpretable gene rankings linked to insulin secretion and stress pathways. By demonstrating both high predictive accuracy and biologically meaningful feature importance, the work offers a framework for biomarker discovery and mechanistic understanding of T2D that can inform precision medicine and translational research. The complementary use of interpretable (PLS-DA) and scalable (ETC) models strengthens confidence in shared molecular signatures across diabetic states.

Abstract

Diabetes is a chronic metabolic disorder characterized by elevated blood glucose levels due to impaired insulin production or function. Two main forms are recognized: type 1 diabetes (T1D), which involves autoimmune destruction of insulin-producing \b{eta}-cells, and type 2 diabetes (T2D), which arises from insulin resistance and progressive \b{eta}-cell dysfunction. Understanding the molecular mechanisms underlying these diseases is essential for the development of improved therapeutic strategies, particularly those targeting \b{eta}-cell dysfunction. To investigate these mechanisms in a controlled and biologically interpretable setting, mouse models have played a central role in diabetes research. Owing to their genetic and physiological similarity to humans, together with the ability to precisely manipulate their genome, mice enable detailed investigation of disease progression and gene function. In particular, mouse models have provided critical insights into \b{eta}-cell development, cellular heterogeneity, and functional failure under diabetic conditions. Building on these experimental advances, this study applies machine learning methods to single-cell transcriptomic data from mouse pancreatic islets. Specifically, we evaluate two supervised approaches identified in the literature; Extra Trees Classifier (ETC) and Partial Least Squares Discriminant Analysis (PLS-DA), to assess their ability to identify T2D-associated gene expression signatures at single-cell resolution. Model performance is evaluated using standard classification metrics, with an emphasis on interpretability and biological relevance
Paper Structure (12 sections, 2 equations, 7 figures, 2 tables)

This paper contains 12 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: UMAP projection of annotated cell types in the Mouse Islet Atlas. Each colour denotes a specific cell type, including endocrine populations (e.g., $\alpha$, $\beta$, $\delta$), exocrine cells (acinar and ductal), and non-pancreatic cell types such as immune, endothelial, and stellate cells.
  • Figure 2: ROC curve for the PLS-DA model evaluated on the test set. The area under the curve (AUC) is 0.999, indicating near-perfect discrimination between diabetic and healthy cells.
  • Figure 3: PLS-DA scores for test samples projected onto the first two latent components. Blue points correspond to healthy cells, while orange points correspond to T2D cells. The clear separation between groups indicates strong transcriptomic differences between conditions.
  • Figure 4: Confusion matrix summarizing the PLS-DA model predictions on the test set. Rows correspond to true labels, and columns correspond to predicted labels. Correct classifications lie along the diagonal.
  • Figure 5: Cross-validated ROC (ETC). The curve lies near the top-left corner; the mean AUC is approximately 0.999.
  • ...and 2 more figures