Table of Contents
Fetching ...

Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics Data

Sajad Amouei Sheshkal, Morten Gundersen, Michael Alexander Riegler, Øygunn Aass Utheim, Kjell Gunnar Gundersen, Hugo Lewi Hammer

TL;DR

This study tackles the challenge of classifying cataract patients by dry eye disease status using high-dimensional metabolomics data from two ionization modes. By comparing nine machine learning models with nested cross-validation, the authors show that a logistic ridge regression model on a merged ESI+ and ESI- dataset achieves the strongest AUROC (0.8378) and competitive reliability across multiple metrics, with XGBoost and Random Forest close behind. Merging data from both metabolomics modes improves model training and generalization, while the LR model offers interpretable insights into metabolite signatures associated with DED. The findings advance metabolomics-guided, ML-based screening for dry eye disease in pre-surgical cataract patients and lay groundwork for future explainable AI analyses to pinpoint clinically relevant biomarkers.

Abstract

Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored using machine learning and metabolomics information to identify which cataract patients suffered from dry eye disease. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of nine machine learning models on three metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance.

Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics Data

TL;DR

This study tackles the challenge of classifying cataract patients by dry eye disease status using high-dimensional metabolomics data from two ionization modes. By comparing nine machine learning models with nested cross-validation, the authors show that a logistic ridge regression model on a merged ESI+ and ESI- dataset achieves the strongest AUROC (0.8378) and competitive reliability across multiple metrics, with XGBoost and Random Forest close behind. Merging data from both metabolomics modes improves model training and generalization, while the LR model offers interpretable insights into metabolite signatures associated with DED. The findings advance metabolomics-guided, ML-based screening for dry eye disease in pre-surgical cataract patients and lay groundwork for future explainable AI analyses to pinpoint clinically relevant biomarkers.

Abstract

Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored using machine learning and metabolomics information to identify which cataract patients suffered from dry eye disease. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of nine machine learning models on three metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance.
Paper Structure (14 sections, 6 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 2 figures, 5 tables, 1 algorithm.

Figures (2)

  • Figure 1: Operational Flow of proposed Dry Eye Disease classification.
  • Figure 2: Visualizing model performance: The graph provides a comparison of AUC scores derived from 10-fold cross-validation, highlighting the balance between mean AUC and AUC SD. The lighter and darker shades indicate base and optimized models, respectively.