Table of Contents
Fetching ...

Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality

Sangita Das, Subhrajyoti Maji

TL;DR

The paper tackles the problem of accurately predicting COVID-19 mortality by focusing on data quality and preprocessing. It presents a custom preprocessing pipeline that includes weekly pattern imputation, local outlier processing, computation-based feature derivation, and iterative feature selection, contrasting it with a standard preprocessing baseline. The main contribution is empirical evidence that the custom pipeline substantially improves predictive performance, with the MLPRegressor achieving RMSE $66.556$ and $R^2=0.991$, compared to the standard pipeline's best $DT$ with RMSE $222.858$ and $R^2=0.817$, and with a much lower RMSE variance indicating stronger generalization. The results demonstrate the value of tailored preprocessing for time-series COVID-19 data and offer generalizable guidance for applying these techniques to other datasets and contexts.

Abstract

Accurate predictive models are crucial for analysing COVID-19 mortality trends. This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality using data from Our World in Data (OWID). Our pipeline differs from a standard preprocessing pipeline through four key steps. Firstly, it transforms weekly reported totals into daily updates, correcting reporting biases and providing more accurate estimates. Secondly, it uses localised outlier detection and processing to preserve data variance and enhance accuracy. Thirdly, it utilises computational dependencies among columns to ensure data consistency. Finally, it incorporates an iterative feature selection process to optimise the feature set and improve model performance. Results show a significant improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 66.556 and a test R-squared of 0.991, surpassing the DecisionTree Regressor from the standard pipeline, which had a test RMSE of 222.858 and a test R-squared of 0.817. These findings highlight the importance of tailored preprocessing techniques in enhancing predictive modelling accuracy for COVID-19 mortality. Although specific to this study, these methodologies offer valuable insights into diverse datasets and domains, improving predictive performance across various contexts.

Impact of Comprehensive Data Preprocessing on Predictive Modelling of COVID-19 Mortality

TL;DR

The paper tackles the problem of accurately predicting COVID-19 mortality by focusing on data quality and preprocessing. It presents a custom preprocessing pipeline that includes weekly pattern imputation, local outlier processing, computation-based feature derivation, and iterative feature selection, contrasting it with a standard preprocessing baseline. The main contribution is empirical evidence that the custom pipeline substantially improves predictive performance, with the MLPRegressor achieving RMSE and , compared to the standard pipeline's best with RMSE and , and with a much lower RMSE variance indicating stronger generalization. The results demonstrate the value of tailored preprocessing for time-series COVID-19 data and offer generalizable guidance for applying these techniques to other datasets and contexts.

Abstract

Accurate predictive models are crucial for analysing COVID-19 mortality trends. This study evaluates the impact of a custom data preprocessing pipeline on ten machine learning models predicting COVID-19 mortality using data from Our World in Data (OWID). Our pipeline differs from a standard preprocessing pipeline through four key steps. Firstly, it transforms weekly reported totals into daily updates, correcting reporting biases and providing more accurate estimates. Secondly, it uses localised outlier detection and processing to preserve data variance and enhance accuracy. Thirdly, it utilises computational dependencies among columns to ensure data consistency. Finally, it incorporates an iterative feature selection process to optimise the feature set and improve model performance. Results show a significant improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 66.556 and a test R-squared of 0.991, surpassing the DecisionTree Regressor from the standard pipeline, which had a test RMSE of 222.858 and a test R-squared of 0.817. These findings highlight the importance of tailored preprocessing techniques in enhancing predictive modelling accuracy for COVID-19 mortality. Although specific to this study, these methodologies offer valuable insights into diverse datasets and domains, improving predictive performance across various contexts.
Paper Structure (55 sections, 10 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 55 sections, 10 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: COVID Data Preprocessing Pipelines for (a) Standard and (b) Custom approaches
  • Figure 2: Comparison of Original vs. Custom Processed $new\_deaths$ Data (Zoomed-In View of Samples from index 400 to 600): Highlighting the Effectiveness of Weekly Pattern Imputation.
  • Figure 3: Comparison of outlier detection and winsorization techniques for the $new\_vaccinations$ column from index 400 to 1000. (a) Global outlier detection and processing with the standard pipeline, and (b) Local outlier detection and processing with the custom pipeline.
  • Figure 4: Column dependency graph for the 'death' columns with processing orders
  • Figure 5: Plotting for computation processing of $positive\_rate$ column