ExioML: Eco-economic dataset for Machine Learning in Global Sectoral Sustainability
Yanming Guo, Charles Guan, Jin Ma
TL;DR
ExioML addresses the lack of open ML-ready benchmarks for environmentally extended multi-regional input-output (EE-MRIO) analysis by introducing a high-resolution benchmark built on ExioBase 3.8.2, featuring two data modalities: $PxP$ (200 products) and $IxI$ (163 industries) across 49 regions from 1995 to 2022. It enables graph- and tabular-based ML with GPU-accelerated footprint calculations and an open toolkit for flexible factor selection, and it validates usability via a sectoral $GHG$ emissions regression achieving low $\mathrm{MSE}$. Deep models (e.g., GANDALF) generally outperform shallow baselines, with RF/GBDT offering competitive performance at lower compute costs, establishing a robust baseline for future EE-ML research. By reducing data access barriers and providing reproducible, scalable MRIO contexts, ExioML aims to foster climate action insights and sustainable investment decisions through interdisciplinary ML applications.
Abstract
The Environmental Extended Multi-Regional Input-Output analysis is the predominant framework in Ecological Economics for assessing the environmental impact of economic activities. This paper introduces ExioML, the first Machine Learning benchmark dataset designed for sustainability analysis, aimed at lowering barriers and fostering collaboration between Machine Learning and Ecological Economics research. A crucial greenhouse gas emission regression task was conducted to evaluate sectoral sustainability and demonstrate the usability of the dataset. We compared the performance of traditional shallow models with deep learning models, utilizing a diverse Factor Accounting table and incorporating various categorical and numerical features. Our findings reveal that ExioML, with its high usability, enables deep and ensemble models to achieve low mean square errors, establishing a baseline for future Machine Learning research. Through ExioML, we aim to build a foundational dataset supporting various Machine Learning applications and promote climate actions and sustainable investment decisions.
