Table of Contents
Fetching ...

A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals

Liam J Berrisford, Hugo Barbosa, Ronaldo Menezes

TL;DR

A scalable, data-driven, supervised machine learning framework designed to impute missing temporal and spatial measurements for air pollutants, thereby generating a comprehensive dataset for air pollutants including NO2, O3, PM10, PM2.5 and SO2 is developed.

Abstract

Global ambient air pollution, a transboundary challenge, is typically addressed through interventions relying on data from spatially sparse and heterogeneously placed monitoring stations. These stations often encounter temporal data gaps due to issues such as power outages. In response, we have developed a scalable, data-driven, supervised machine learning framework. This model is designed to impute missing temporal and spatial measurements, thereby generating a comprehensive dataset for pollutants including NO$_2$, O$_3$, PM$_{10}$, PM$_{2.5}$, and SO$_2$. The dataset, with a fine granularity of 0.25$^{\circ}$ at hourly intervals and accompanied by prediction intervals for each estimate, caters to a wide range of stakeholders relying on outdoor air pollution data for downstream assessments. This enables more detailed studies. Additionally, the model's performance across various geographical locations is examined, providing insights and recommendations for strategic placement of future monitoring stations to further enhance the model's accuracy.

A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals

TL;DR

A scalable, data-driven, supervised machine learning framework designed to impute missing temporal and spatial measurements for air pollutants, thereby generating a comprehensive dataset for air pollutants including NO2, O3, PM10, PM2.5 and SO2 is developed.

Abstract

Global ambient air pollution, a transboundary challenge, is typically addressed through interventions relying on data from spatially sparse and heterogeneously placed monitoring stations. These stations often encounter temporal data gaps due to issues such as power outages. In response, we have developed a scalable, data-driven, supervised machine learning framework. This model is designed to impute missing temporal and spatial measurements, thereby generating a comprehensive dataset for pollutants including NO, O, PM, PM, and SO. The dataset, with a fine granularity of 0.25 at hourly intervals and accompanied by prediction intervals for each estimate, caters to a wide range of stakeholders relying on outdoor air pollution data for downstream assessments. This enables more detailed studies. Additionally, the model's performance across various geographical locations is examined, providing insights and recommendations for strategic placement of future monitoring stations to further enhance the model's accuracy.
Paper Structure (24 sections, 15 figures, 5 tables)

This paper contains 24 sections, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Spatial Distribution of Monitoring Station Locations Within the OpenAQ Dataset for All Air Pollutants in 2022. The map reveals a high density of stations within Europe and the US relative to the geographical areas these regions cover. This underscores the disparity between countries regarding the extent of their monitoring station networks and the availability of data to address air pollution challenges.
  • Figure 2: 100m U Component of Wind From the Meteorological Dataset Family.
  • Figure 3: Absorbing Aerosol Index From Sentinel 5P for the Remote Sensing Dataset Family.
  • Figure 4: Biogenic Emissions Example for the Emissions Dataset Family.
  • Figure 5: Model Predictions for a Well-Performing (EEA Spain 4327) and a Poor-Performing (CAAQM 8171) Monitoring Station During the Baseline Experiment. While the model does not perform well for CAAQM 8171, looking at the data does raise questions about its validity, with the NO$_2$ concentrations having poor precision with only integers recorded and never exceeding 2µg/m^3, highly unlikely given the industrial location of Nandesari, India. In contrast, EEA Spain 4327 performs very well, with data that appears accurate. Highlighting the contrast in the quality of data in the dataset used and the importance of performing a baseline experiment rather than simply assuming all monitoring stations represent accurate data.
  • ...and 10 more figures