Air Pollution Forecasting in Bucharest
Dragoş-Andrei Şerban, Răzvan-Alexandru Smădu, Dumitru-Clementin Cercel
TL;DR
The paper tackles PM2.5 forecasting in Bucharest by evaluating a broad spectrum of models—from traditional linear and ensemble methods to deep learning, transformers, and LLM-based approaches—across multiple horizons ($1$,$2$,$4$ hours, with some $8$-hour cases). It introduces a Bucharest-specific dataset with extensive pollutant and meteorological features, after applying comprehensive preprocessing including outlier handling via $FBEWMA$ and lag-feature engineering. The study finds that transformer-based models generally provide the best predictive performance, with advanced RNNs and hybrid architectures also performing well, while LLMs with RAG offer limited improvements. Limitations include reliance on a single measurement station and the absence of exogenous data like traffic, suggesting future work with multi-station data and richer external features to further improve forecasts and capture seasonality and spatial variability.
Abstract
Air pollution, especially the particulate matter 2.5 (PM2.5), has become a growing concern in recent years, primarily in urban areas. Being exposed to air pollution is linked to developing numerous health problems, like the aggravation of respiratory diseases, cardiovascular disorders, lung function impairment, and even cancer or early death. Forecasting future levels of PM2.5 has become increasingly important over the past few years, as it can provide early warnings and help prevent diseases. This paper aims to design, fine-tune, test, and evaluate machine learning models for predicting future levels of PM2.5 over various time horizons. Our primary objective is to assess and compare the performance of multiple models, ranging from linear regression algorithms and ensemble-based methods to deep learning models, such as advanced recurrent neural networks and transformers, as well as large language models, on this forecasting task.
