Table of Contents
Fetching ...

First Experiences with the Identification of People at Risk for Diabetes in Argentina using Machine Learning Techniques

Enzo Rucci, Gonzalo Tittarelli, Franco Ronchetti, Jorge F. Elgart, Laura Lanzarini, Juan José Gagliardino

TL;DR

The paper tackles identifying individuals at risk for Type 2 Diabetes and Prediabetes in Argentina using machine learning on the PPDBA database, addressing the challenge of population-specific risk profiles. It preprocesses data, creates three segmentation schemes balancing features and sample size, and evaluates five classifiers, notably Random Forest, Decision Tree, and Artificial Neural Networks, across these segmentations. The strongest results appear on CLD-bin and CGD-bin, with accuracies above 90% and AUCs around 0.93–0.95, while excluding lab features (CD-bin) markedly reduces performance. This work demonstrates the feasibility of population-specific, data-driven screening tools in Argentina and lays groundwork for expanding data collection and exploring multiclass or clinical-only approaches in the future.

Abstract

Detecting Type 2 Diabetes (T2D) and Prediabetes (PD) is a real challenge for medicine due to the absence of pathogenic symptoms and the lack of known associated risk factors. Even though some proposals for machine learning models enable the identification of people at risk, the nature of the condition makes it so that a model suitable for one population may not necessarily be suitable for another. In this article, the development and assessment of predictive models to identify people at risk for T2D and PD specifically in Argentina are discussed. First, the database was thoroughly preprocessed and three specific datasets were generated considering a compromise between the number of records and the amount of available variables. After applying 5 different classification models, the results obtained show that a very good performance was observed for two datasets with some of these models. In particular, RF, DT, and ANN demonstrated great classification power, with good values for the metrics under consideration. Given the lack of this type of tool in Argentina, this work represents the first step towards the development of more sophisticated models.

First Experiences with the Identification of People at Risk for Diabetes in Argentina using Machine Learning Techniques

TL;DR

The paper tackles identifying individuals at risk for Type 2 Diabetes and Prediabetes in Argentina using machine learning on the PPDBA database, addressing the challenge of population-specific risk profiles. It preprocesses data, creates three segmentation schemes balancing features and sample size, and evaluates five classifiers, notably Random Forest, Decision Tree, and Artificial Neural Networks, across these segmentations. The strongest results appear on CLD-bin and CGD-bin, with accuracies above 90% and AUCs around 0.93–0.95, while excluding lab features (CD-bin) markedly reduces performance. This work demonstrates the feasibility of population-specific, data-driven screening tools in Argentina and lays groundwork for expanding data collection and exploring multiclass or clinical-only approaches in the future.

Abstract

Detecting Type 2 Diabetes (T2D) and Prediabetes (PD) is a real challenge for medicine due to the absence of pathogenic symptoms and the lack of known associated risk factors. Even though some proposals for machine learning models enable the identification of people at risk, the nature of the condition makes it so that a model suitable for one population may not necessarily be suitable for another. In this article, the development and assessment of predictive models to identify people at risk for T2D and PD specifically in Argentina are discussed. First, the database was thoroughly preprocessed and three specific datasets were generated considering a compromise between the number of records and the amount of available variables. After applying 5 different classification models, the results obtained show that a very good performance was observed for two datasets with some of these models. In particular, RF, DT, and ANN demonstrated great classification power, with good values for the metrics under consideration. Given the lack of this type of tool in Argentina, this work represents the first step towards the development of more sophisticated models.
Paper Structure (25 sections, 1 figure, 4 tables)