Indoor Air Quality Dataset with Activities of Daily Living in Low to Middle-income Communities
Prasenjit Karmakar, Swadhin Pradhan, Sandip Chakraborty
TL;DR
This paper tackles the paucity of indoor air quality data in developing countries by presenting a large-scale, activity-contextualized IAQ dataset collected in India. The authors deploy a low-cost, multi-sensor platform (DALTON) across 30 sites in four regions over six months, coupled with real-time activity annotations via a speech-to-text app, and provide floor plans to study pollutant spread. The dataset comprises around 89.1 million pollutant samples and 3957 activity annotations, enabling analyses of source emission, ventilation effects, and floor-plan influences, as well as ML tasks like activity recognition and cooking-item classification with strong performance in controlled scenarios. Openly available under AGPL-3.0, the dataset supports data-driven indoor design, smart ventilation policies, and pollution-aware applications in LMIC contexts, with ongoing updates and community contributions.
Abstract
In recent years, indoor air pollution has posed a significant threat to our society, claiming over 3.2 million lives annually. Developing nations, such as India, are most affected since lack of knowledge, inadequate regulation, and outdoor air pollution lead to severe daily exposure to pollutants. However, only a limited number of studies have attempted to understand how indoor air pollution affects developing countries like India. To address this gap, we present spatiotemporal measurements of air quality from 30 indoor sites over six months during summer and winter seasons. The sites are geographically located across four regions of type: rural, suburban, and urban, covering the typical low to middle-income population in India. The dataset contains various types of indoor environments (e.g., studio apartments, classrooms, research laboratories, food canteens, and residential households), and can provide the basis for data-driven learning model research aimed at coping with unique pollution patterns in developing countries. This unique dataset demands advanced data cleaning and imputation techniques for handling missing data due to power failure or network outages during data collection. Furthermore, through a simple speech-to-text application, we provide real-time indoor activity labels annotated by occupants. Therefore, environmentalists and ML enthusiasts can utilize this dataset to understand the complex patterns of the pollutants under different indoor activities, identify recurring sources of pollution, forecast exposure, improve floor plans and room structures of modern indoor designs, develop pollution-aware recommender systems, etc.
