Table of Contents
Fetching ...

A Framework for Scalable Ambient Air Pollution Concentration Estimation

Liam J Berrisford, Lucy S Neal, Helen J Buttery, Benjamin R Evans, Ronaldo Menezes

TL;DR

This work presents a scalable data-driven framework for ambient air pollution concentration estimation that fills temporal and spatial gaps in the UK monitoring network to produce hourly 1km$^2$ predictions across England for 2018. Using LightGBM on 152 features spanning seven data families, the model forecasts concentrations, estimates data at unseen locations, and predicts peak values, validated on 2017 data and tested on 2018 data with a 2014–2016 training window. It yields 355,827 synthetic stations and supports two open data products: an augmented AURN dataset and a high-resolution England pollution map, enabling high-fidelity exposure assessments and policy analysis. The approach offers a fast, parallelizable surrogate capable of covering large areas without extensive infrastructure, while acknowledging data gaps and suggesting avenues for improvement with upcoming sensing technologies and cross-border applications.

Abstract

Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality. However, the current air pollution monitoring station network in the UK is characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements. This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area, yielding data valued at approximately \pounds70 billion. Validation was conducted to assess the model's performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for NO2, O3, PM10, PM2.5, and SO2. This resource empowers stakeholders to conduct studies at a higher resolution than was previously possible.

A Framework for Scalable Ambient Air Pollution Concentration Estimation

TL;DR

This work presents a scalable data-driven framework for ambient air pollution concentration estimation that fills temporal and spatial gaps in the UK monitoring network to produce hourly 1km predictions across England for 2018. Using LightGBM on 152 features spanning seven data families, the model forecasts concentrations, estimates data at unseen locations, and predicts peak values, validated on 2017 data and tested on 2018 data with a 2014–2016 training window. It yields 355,827 synthetic stations and supports two open data products: an augmented AURN dataset and a high-resolution England pollution map, enabling high-fidelity exposure assessments and policy analysis. The approach offers a fast, parallelizable surrogate capable of covering large areas without extensive infrastructure, while acknowledging data gaps and suggesting avenues for improvement with upcoming sensing technologies and cross-border applications.

Abstract

Ambient air pollution remains a critical issue in the United Kingdom, where data on air pollution concentrations form the foundation for interventions aimed at improving air quality. However, the current air pollution monitoring station network in the UK is characterized by spatial sparsity, heterogeneous placement, and frequent temporal data gaps, often due to issues such as power outages. We introduce a scalable data-driven supervised machine learning model framework designed to address temporal and spatial data gaps by filling missing measurements. This approach provides a comprehensive dataset for England throughout 2018 at a 1kmx1km hourly resolution. Leveraging machine learning techniques and real-world data from the sparsely distributed monitoring stations, we generate 355,827 synthetic monitoring stations across the study area, yielding data valued at approximately \pounds70 billion. Validation was conducted to assess the model's performance in forecasting, estimating missing locations, and capturing peak concentrations. The resulting dataset is of particular interest to a diverse range of stakeholders engaged in downstream assessments supported by outdoor air pollution concentration data for NO2, O3, PM10, PM2.5, and SO2. This resource empowers stakeholders to conduct studies at a higher resolution than was previously possible.
Paper Structure (25 sections, 1 equation, 11 figures, 6 tables)

This paper contains 25 sections, 1 equation, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Leominster AURN monitoring station NO$_{2}$ measurements. Figure \ref{['main:fig:LeominsterAURNStationPeakDay']} shows how the peak air pollution reading for NO$_{2}$ at the Leominster station dramatically exceeds the 24-hour limit, even more so for the annual limit, showing how there can be periods of quite extreme pollution in the context of the annual limits. Figure \ref{['main:fig:LeominsterAURNStationPeakYear']} shows how there can be extended periods where the air pollution levels are below and exceed the designated limits and the relation of the monitoring station peak to all available data for the station in 2014.
  • Figure 2: Example feature vector dataset from each dataset family. From left to right, the example datasets are the majority land use classification for each grid (geographic family, discussed in Section \ref{['S-sec:landUse']}), Sentinel 5P NO$_2$ measurements (remote sensing family, discussed in Section \ref{['S-sec:remoteSensingData']}), 100m U component of wind (meteorological family, discussed in Section \ref{['S-sec:DataDetails:metrological']}), NAEI SNAP sector 7 (road transport) NO$_x$ emissions (emissions family, discussed in Section \ref{['S-sec:Datadetails:emissions']}), road infrastructure distance from the nearest motorway and total length of residential road per grid (transport infrastructure structural properties family, discussed in Section \ref{['S-sec:Datadetails:TransportDataInfratsurtcureStructural']}), and the car and taxis score (transport use family, discussed in Section \ref{['S-sec:transportInfrastructureUseData']}).
  • Figure 3: Spearman correlation coefficients overall mean for all pollutants. The mean Spearman correlation coefficients for NO$_x$ and O$_3$ across all the environmental classifications of the AURN network for the ten most extreme, both positive and negative, for the feature vectors are shown. The sources and sinks of the air pollutants are different, aligning with the scientific literature (Section \ref{['main:sec:featureVectors']}), with NO$_x$ being highly positively correlated with emission features, whereas O$_3$ exhibits such a relationship mainly with meteorological features, such as wind gusts. Regarding negative correlations, the two air pollutants exhibit counter relationships, with NO$_x$ having a negative correlation with the meteorological. The analysis highlights how the relationships between a particular phenomenon and a given air pollutant can be widely different in strength.
  • Figure 4: Spearman correlation coefficients for NO$_x$ monitoring station environmental subclassification locations, Rural Background and Urban Traffic. While Figure \ref{['main:fig:spearmanCorrelationsFeatureVectorAirPollutants']} highlights the difference between phenomena and air pollutants, there exists a further difference between environmental subclassifications. For the Urban Traffic monitoring stations, it can be seen that the primary positive correlations are related to road transport as would be expected (the strong relationship with Solvent Use is likely an artefact of the scaling performed and discussed in Section \ref{['main:sec:featureVectors']} and \ref{['S-sec:Datadetails:emissions']}, alongside a limited sample size of 41 stations). In contrast, the Rural Background monitoring stations show a strong relationship with emissions from the residential sector, highlighting that the sources and sinks for an air pollutant depend on the air pollutant itself and the location of interest.
  • Figure 5: Spearman correlation heatmap between all feature vectors. The grey lines throughout the heatmap show the data points missing from the dataset, phenomena with no monitoring stations across all pollutants, including four geographic features and nine emissions features.
  • ...and 6 more figures