A Framework for Scalable Ambient Air Pollution Concentration Estimation
Liam J Berrisford, Lucy S Neal, Helen J Buttery, Benjamin R Evans, Ronaldo Menezes
TL;DR
This work presents a scalable data-driven framework for ambient air pollution concentration estimation that fills temporal and spatial gaps in the UK monitoring network to produce hourly 1km$^2$ predictions across England for 2018. Using LightGBM on 152 features spanning seven data families, the model forecasts concentrations, estimates data at unseen locations, and predicts peak values, validated on 2017 data and tested on 2018 data with a 2014–2016 training window. It yields 355,827 synthetic stations and supports two open data products: an augmented AURN dataset and a high-resolution England pollution map, enabling high-fidelity exposure assessments and policy analysis. The approach offers a fast, parallelizable surrogate capable of covering large areas without extensive infrastructure, while acknowledging data gaps and suggesting avenues for improvement with upcoming sensing technologies and cross-border applications.
Abstract
Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality. However, the current air pollution monitoring station network in the UK is characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements. This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area, yielding data valued at approximately \pounds70 billion. Validation was conducted to assess the model's performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for NO2, O3, PM10, PM2.5, and SO2. This resource empowers stakeholders to conduct studies at a higher resolution than was previously possible.
