Table of Contents
Fetching ...

Machine Learning and Statistical Insights into Hospital Stay Durations: The Italian EHR Case

Marina Andric, Mauro Dragoni

TL;DR

The study investigates factors shaping hospital stay durations (LoS) using Italian EHR data from 66 facilities in Piedmont (2020–2023) and compares predictive models. It integrates statistical analyses with machine learning (CatBoost and Random Forest), leveraging varied feature representations including patient demographics, comorbidities, embeddings for diagnoses/procedures, and historical LoS. Significant associations emerge with age, comorbidity burden, admission type, and admission month, with CatBoost achieving a maximum $R^2$ of $0.49$ on validation/test when all features are included. Historical LoS stands out as the strongest predictor, while hospital-volume effects evolve over time, underscoring the value of robust LoS forecasting for hospital resource planning. The findings point to the utility of advanced encoding schemes and future work exploring task-specific embeddings to further enhance predictive performance.

Abstract

Length of hospital stay is a critical metric for assessing healthcare quality and optimizing hospital resource management. This study aims to identify factors influencing LoS within the Italian healthcare context, using a dataset of hospitalization records from over 60 healthcare facilities in the Piedmont region, spanning from 2020 to 2023. We explored a variety of features, including patient characteristics, comorbidities, admission details, and hospital-specific factors. Significant correlations were found between LoS and features such as age group, comorbidity score, admission type, and the month of admission. Machine learning models, specifically CatBoost and Random Forest, were used to predict LoS. The highest R2 score, 0.49, was achieved with CatBoost, demonstrating good predictive performance.

Machine Learning and Statistical Insights into Hospital Stay Durations: The Italian EHR Case

TL;DR

The study investigates factors shaping hospital stay durations (LoS) using Italian EHR data from 66 facilities in Piedmont (2020–2023) and compares predictive models. It integrates statistical analyses with machine learning (CatBoost and Random Forest), leveraging varied feature representations including patient demographics, comorbidities, embeddings for diagnoses/procedures, and historical LoS. Significant associations emerge with age, comorbidity burden, admission type, and admission month, with CatBoost achieving a maximum of on validation/test when all features are included. Historical LoS stands out as the strongest predictor, while hospital-volume effects evolve over time, underscoring the value of robust LoS forecasting for hospital resource planning. The findings point to the utility of advanced encoding schemes and future work exploring task-specific embeddings to further enhance predictive performance.

Abstract

Length of hospital stay is a critical metric for assessing healthcare quality and optimizing hospital resource management. This study aims to identify factors influencing LoS within the Italian healthcare context, using a dataset of hospitalization records from over 60 healthcare facilities in the Piedmont region, spanning from 2020 to 2023. We explored a variety of features, including patient characteristics, comorbidities, admission details, and hospital-specific factors. Significant correlations were found between LoS and features such as age group, comorbidity score, admission type, and the month of admission. Machine learning models, specifically CatBoost and Random Forest, were used to predict LoS. The highest R2 score, 0.49, was achieved with CatBoost, demonstrating good predictive performance.

Paper Structure

This paper contains 16 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Distribution of length of stay in days overall (left) and for specific years (right). In the violin plots, observations above the 98th percentile have been excluded for clarity, while black horizontal bars denote the 25th, 50th (median), and 75th percentiles.
  • Figure 2: SHAP feature importance bar plot (Left) and residuals distribution histogram (Right) for the CatBoost model trained on 2021 data with all features. Outliers below the 2nd and above the 98th percentile were removed from the histogram for clarity.