Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics Data
Sajad Amouei Sheshkal, Morten Gundersen, Michael Alexander Riegler, Øygunn Aass Utheim, Kjell Gunnar Gundersen, Hugo Lewi Hammer
TL;DR
This study tackles the challenge of classifying cataract patients by dry eye disease status using high-dimensional metabolomics data from two ionization modes. By comparing nine machine learning models with nested cross-validation, the authors show that a logistic ridge regression model on a merged ESI+ and ESI- dataset achieves the strongest AUROC (0.8378) and competitive reliability across multiple metrics, with XGBoost and Random Forest close behind. Merging data from both metabolomics modes improves model training and generalization, while the LR model offers interpretable insights into metabolite signatures associated with DED. The findings advance metabolomics-guided, ML-based screening for dry eye disease in pre-surgical cataract patients and lay groundwork for future explainable AI analyses to pinpoint clinically relevant biomarkers.
Abstract
Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored using machine learning and metabolomics information to identify which cataract patients suffered from dry eye disease. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of nine machine learning models on three metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance.
