Table of Contents
Fetching ...

Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction

Yahia Dalbah, Marcel Worring, Yen-Chia Hsu

TL;DR

The paper tackles scalable, real-time air quality monitoring with low-cost sensors by eliminating the need for co-located reference stations. It introduces $Veli$, a reference-free unsupervised Bayesian model that learns a latent representation to separate the true pollutant signal $y$ from sensor noise in $x_{\rm noise}$ by optimizing the variational objective $ELBO$ with latent $z$ and auxiliary data $\psi$. A new benchmark, $AQ-SDR$, aggregates $23{,}737$ sensors across regions and years to standardize evaluation of LCS correction methods. Experiments show substantial MAE reductions in-distribution and strong generalization to out-of-distribution data, with further gains from fine-tuning on new regions and the ability to quantify uncertainty via credible intervals. Together, these contributions enable dense, robust, and scalable AQ monitoring and establish a standardized dataset to accelerate future AQ sensing research.

Abstract

Urban air pollution is a major health crisis causing millions of premature deaths annually, underscoring the urgent need for accurate and scalable monitoring of air quality (AQ). While low-cost sensors (LCS) offer a scalable alternative to expensive reference-grade stations, their readings are affected by drift, calibration errors, and environmental interference. To address these challenges, we introduce Veli (Reference-free Variational Estimation via Latent Inference), an unsupervised Bayesian model that leverages variational inference to correct LCS readings without requiring co-location with reference stations, eliminating a major deployment barrier. Specifically, Veli constructs a disentangled representation of the LCS readings, effectively separating the true pollutant reading from the sensor noise. To build our model and address the lack of standardized benchmarks in AQ monitoring, we also introduce the Air Quality Sensor Data Repository (AQ-SDR). AQ-SDR is the largest AQ sensor benchmark to date, with readings from 23,737 LCS and reference stations across multiple regions. Veli demonstrates strong generalization across both in-distribution and out-of-distribution settings, effectively handling sensor drift and erratic sensor behavior. Code for model and dataset will be made public when this paper is published.

Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction

TL;DR

The paper tackles scalable, real-time air quality monitoring with low-cost sensors by eliminating the need for co-located reference stations. It introduces , a reference-free unsupervised Bayesian model that learns a latent representation to separate the true pollutant signal from sensor noise in by optimizing the variational objective with latent and auxiliary data . A new benchmark, , aggregates sensors across regions and years to standardize evaluation of LCS correction methods. Experiments show substantial MAE reductions in-distribution and strong generalization to out-of-distribution data, with further gains from fine-tuning on new regions and the ability to quantify uncertainty via credible intervals. Together, these contributions enable dense, robust, and scalable AQ monitoring and establish a standardized dataset to accelerate future AQ sensing research.

Abstract

Urban air pollution is a major health crisis causing millions of premature deaths annually, underscoring the urgent need for accurate and scalable monitoring of air quality (AQ). While low-cost sensors (LCS) offer a scalable alternative to expensive reference-grade stations, their readings are affected by drift, calibration errors, and environmental interference. To address these challenges, we introduce Veli (Reference-free Variational Estimation via Latent Inference), an unsupervised Bayesian model that leverages variational inference to correct LCS readings without requiring co-location with reference stations, eliminating a major deployment barrier. Specifically, Veli constructs a disentangled representation of the LCS readings, effectively separating the true pollutant reading from the sensor noise. To build our model and address the lack of standardized benchmarks in AQ monitoring, we also introduce the Air Quality Sensor Data Repository (AQ-SDR). AQ-SDR is the largest AQ sensor benchmark to date, with readings from 23,737 LCS and reference stations across multiple regions. Veli demonstrates strong generalization across both in-distribution and out-of-distribution settings, effectively handling sensor drift and erratic sensor behavior. Code for model and dataset will be made public when this paper is published.

Paper Structure

This paper contains 37 sections, 23 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: A snapshot from the AQ-SDR dashboard of sensors in the city of Utrecht in the Netherlands. The query area shows the results of applying our method, Veli, on hourly noisy readings from deployed LCS over four days.
  • Figure 2: Probability Density Function (PDF) of PM$_{2.5}$ readings from an LCS device co-located next to a reference station over 3 years. The PDF of the LCS readings matches the reference in the first year of deployment, then shows significant drift over the next two years, unlike the well-maintained reference station that exhibits consistent behavior.
  • Figure 3: Veli structure following the derivation in Section \ref{['subsection:lcsnoisemodel']}. The input starts with AQ readings $x$ and auxiliary mask of 'NA' readings $\psi$ on the left, propagating through the model's layers to generate a prediction of clean readings $\hat{y}$. Conditioning on $\psi$ is omitted in some blocks for visual clarity but is implemented properly. Prior distribution blocks (green) are used in the training to estimate the variational distribution blocks (blue), which are used in the inference as indicated by the blue dashed line. All distribution blocks are modeled by two multilayer perceptron (MLP) layers followed by an MLP layer for each of the mean and variance. The losses $\mathcal{L}_{KL_z}$, $\mathcal{L}_{KL_y}$, and $\mathcal{L}_{recon}$ correspond to the three terms in eq. \ref{['eq:final_loss']}. Sampling refers to the traditional reparameterization in VAEs vae.
  • Figure 4: PDF Comparison of in-distribution and out-of-distribution data. Readings from the Netherlands are skewed to the left, indicating lower pollution levels, in contrast to the readings from Taiwan that reflect higher levels of pollution.
  • Figure 5: 12-hour-averages for Utrecht's data over two months. The readings of the raw LCS deviate significantly from the reference reading. Veli takes these readings as an input and outputs an accurate corrected measurement that matches the reference's readings. The region in the red-dashed lines is zoomed in on Figure \ref{['fig:hourly-avgs-utrecht']}.
  • ...and 10 more figures