Table of Contents
Fetching ...

A comparison between geostatistical and machine learning models for spatio-temporal prediction of PM2.5 data

Zeinab Mohamed, Wenlong Gong

TL;DR

This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM2.5 maps across California and enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model.

Abstract

Ambient air pollution poses significant health and environmental challenges. Exposure to high concentrations of PM$_{2.5}$ have been linked to increased respiratory and cardiovascular hospital admissions, more emergency department visits and deaths. Traditional air quality monitoring systems such as EPA-certified stations provide limited spatial and temporal data. The advent of low-cost sensors has dramatically improved the granularity of air quality data, enabling real-time, high-resolution monitoring. This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM$_{2.5}$ maps across California. We evaluate traditional geostatistical methods, including kriging and land use regression, against advanced machine learning approaches such as neural networks, random forests, and support vector machines, as well as ensemble model. Our findings enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model, which incorporating both spatiotemporal dependencies and machine learning models.

A comparison between geostatistical and machine learning models for spatio-temporal prediction of PM2.5 data

TL;DR

This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM2.5 maps across California and enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model.

Abstract

Ambient air pollution poses significant health and environmental challenges. Exposure to high concentrations of PM have been linked to increased respiratory and cardiovascular hospital admissions, more emergency department visits and deaths. Traditional air quality monitoring systems such as EPA-certified stations provide limited spatial and temporal data. The advent of low-cost sensors has dramatically improved the granularity of air quality data, enabling real-time, high-resolution monitoring. This study exploits the extensive data from PurpleAir sensors to assess and compare the effectiveness of various statistical and machine learning models in producing accurate hourly PM maps across California. We evaluate traditional geostatistical methods, including kriging and land use regression, against advanced machine learning approaches such as neural networks, random forests, and support vector machines, as well as ensemble model. Our findings enhanced the predictive accuracy of PM2.5 concentration by correcting the bias in PurpleAir data with an ensemble model, which incorporating both spatiotemporal dependencies and machine learning models.

Paper Structure

This paper contains 20 sections, 24 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: PurpleAir sensors counts from 2017-2022 in US
  • Figure 2: Real-time map of PM$_{2.5}$ concentration measured by PurpleAir sensors in 2019 (left) and 2023 (right), reproduced from the PurpleAir public visualization interface purpleair. Marker colors indicate air quality categories (green) satisfactory; (yellow) acceptable; (orange) members of sensitive groups may be affected; (red) the general public may experience health effects; (purple) increased health risk for everyone. Marker sizes reflect visualization rendering and display resolution in the original interface and do not represent the number or density of installed sensors.
  • Figure 3: PM$_{2.5}$ concentration on the log scale on May $18^{th}$, 2019 at 2 pm.
  • Figure 4: Illustration of a Multi-layer Neural Network with $N$ hidden layers.
  • Figure 5: A comparison between all models in the Non-geostatistical groups with respect to root mean square error (RMSE), symmetric mean absolute percentage error (SMAPE), mean absolute deviation (MAD), and the correlation (Cor) between observed and predicted values.
  • ...and 2 more figures