Table of Contents
Fetching ...

Denoising ESG: quantifying data uncertainty from missing data with Machine Learning and prediction intervals

Sergio Caprioli, Jacopo Foschi, Riccardo Crupi, Alessandro Sabatino

TL;DR

The paper addresses missing data in ESG datasets, which can distort ESG scores and hinder comparability. It benchmarks multiple imputation approaches (KNN, MICE with RF, DAE, and GCN) on a real-world ISP dataset and embeds uncertainty quantification via prediction intervals. A five-step MI workflow using MICE with RF, PMM, and LRD generates multiple imputations and synthetic datasets to estimate interval estimates, achieving high coverage around $89\%$–$93\%$ for various ESG levels. The work demonstrates that probabilistic imputation, particularly MI via MICE, enables risk-aware ESG scoring by propagating missing-data uncertainty into the final ratings, with GCN offering similar accuracy at higher cost.

Abstract

Environmental, Social, and Governance (ESG) datasets are frequently plagued by significant data gaps, leading to inconsistencies in ESG ratings due to varying imputation methods. This paper explores the application of established machine learning techniques for imputing missing data in a real-world ESG dataset, emphasizing the quantification of uncertainty through prediction intervals. By employing multiple imputation strategies, this study assesses the robustness of imputation methods and quantifies the uncertainty associated with missing data. The findings highlight the importance of probabilistic machine learning models in providing better understanding of ESG scores, thereby addressing the inherent risks of wrong ratings due to incomplete data. This approach improves imputation practices to enhance the reliability of ESG ratings.

Denoising ESG: quantifying data uncertainty from missing data with Machine Learning and prediction intervals

TL;DR

The paper addresses missing data in ESG datasets, which can distort ESG scores and hinder comparability. It benchmarks multiple imputation approaches (KNN, MICE with RF, DAE, and GCN) on a real-world ISP dataset and embeds uncertainty quantification via prediction intervals. A five-step MI workflow using MICE with RF, PMM, and LRD generates multiple imputations and synthetic datasets to estimate interval estimates, achieving high coverage around for various ESG levels. The work demonstrates that probabilistic imputation, particularly MI via MICE, enables risk-aware ESG scoring by propagating missing-data uncertainty into the final ratings, with GCN offering similar accuracy at higher cost.

Abstract

Environmental, Social, and Governance (ESG) datasets are frequently plagued by significant data gaps, leading to inconsistencies in ESG ratings due to varying imputation methods. This paper explores the application of established machine learning techniques for imputing missing data in a real-world ESG dataset, emphasizing the quantification of uncertainty through prediction intervals. By employing multiple imputation strategies, this study assesses the robustness of imputation methods and quantifies the uncertainty associated with missing data. The findings highlight the importance of probabilistic machine learning models in providing better understanding of ESG scores, thereby addressing the inherent risks of wrong ratings due to incomplete data. This approach improves imputation practices to enhance the reliability of ESG ratings.
Paper Structure (7 sections, 4 figures, 1 table)

This paper contains 7 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Test Root Mean Squared Error (RMSE) of imputation methods. Boxplots account for RMSE variability among KPIs. Mean values are reported as red triangles.
  • Figure 2: Comparison of marginal distributions of observed values and imputed values from 5 imputations of KPIs belonging to KPI "Carbon Footprint".
  • Figure 3: Jointplot comparing distributions Pillar Scores from multiple imputations by MICE with Pillar Scores from single imputation by MICE (a, b and c). Results from 5 example counterparties are reported in different colors. Missing rate of each counterparty is reported in subplot d.
  • Figure 4: Boxplot of width of prediction intervals of ESG Scores of all counterparties, by Tier and missing rate (i.e., proportion of missing KPIs pee counterparty)