Table of Contents
Fetching ...

SpiroActive: Active Learning for Efficient Data Acquisition for Spirometry

Ankita Kumari Jain, Nitish Sharma, Madhav Kanda, Nipun Batra

TL;DR

This research proposes using active learning, a sub-field of machine learning, to mitigate the challenges associated with data collection and labeling, and presents evidence that models trained on small subsets obtained through active learning achieve comparable/better results than models trained on the complete dataset.

Abstract

Respiratory illnesses are a significant global health burden. Respiratory illnesses, primarily Chronic obstructive pulmonary disease (COPD), is the seventh leading cause of poor health worldwide and the third leading cause of death worldwide, causing 3.23 million deaths in 2019, necessitating early identification and diagnosis for effective mitigation. Among the diagnostic tools employed, spirometry plays a crucial role in detecting respiratory abnormalities. However, conventional clinical spirometry methods often entail considerable costs and practical limitations like the need for specialized equipment, trained personnel, and a dedicated clinical setting, making them less accessible. To address these challenges, wearable spirometry technologies have emerged as promising alternatives, offering accurate, cost-effective, and convenient solutions. The development of machine learning models for wearable spirometry heavily relies on the availability of high-quality ground truth spirometry data, which is a laborious and expensive endeavor. In this research, we propose using active learning, a sub-field of machine learning, to mitigate the challenges associated with data collection and labeling. By strategically selecting samples from the ground truth spirometer, we can mitigate the need for resource-intensive data collection. We present evidence that models trained on small subsets obtained through active learning achieve comparable/better results than models trained on the complete dataset.

SpiroActive: Active Learning for Efficient Data Acquisition for Spirometry

TL;DR

This research proposes using active learning, a sub-field of machine learning, to mitigate the challenges associated with data collection and labeling, and presents evidence that models trained on small subsets obtained through active learning achieve comparable/better results than models trained on the complete dataset.

Abstract

Respiratory illnesses are a significant global health burden. Respiratory illnesses, primarily Chronic obstructive pulmonary disease (COPD), is the seventh leading cause of poor health worldwide and the third leading cause of death worldwide, causing 3.23 million deaths in 2019, necessitating early identification and diagnosis for effective mitigation. Among the diagnostic tools employed, spirometry plays a crucial role in detecting respiratory abnormalities. However, conventional clinical spirometry methods often entail considerable costs and practical limitations like the need for specialized equipment, trained personnel, and a dedicated clinical setting, making them less accessible. To address these challenges, wearable spirometry technologies have emerged as promising alternatives, offering accurate, cost-effective, and convenient solutions. The development of machine learning models for wearable spirometry heavily relies on the availability of high-quality ground truth spirometry data, which is a laborious and expensive endeavor. In this research, we propose using active learning, a sub-field of machine learning, to mitigate the challenges associated with data collection and labeling. By strategically selecting samples from the ground truth spirometer, we can mitigate the need for resource-intensive data collection. We present evidence that models trained on small subsets obtained through active learning achieve comparable/better results than models trained on the complete dataset.

Paper Structure

This paper contains 20 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Graphical Representation of the flow-volume curve in spirometry showing the three commonly used lung health indices : Peak Expiratory Flow (PEF), Forced Expiratory Volume in 1 second (FEV1) and Forced Vital Capacity (FVC).
  • Figure 2: Active Learning Loop that showcases the flow of data across the whole pipeline
  • Figure 3: MAPE scores for different train-test splits for the FVC task. The figure shows the importance of choosing a representative yet diverse split, especially in low data settings.
  • Figure 4: Oracle vs Random Sampling for FVC Dataset. The oracle beats random sampling by a huge gap indicating the need for better acquisition strategies and intelligent models
  • Figure 5: The curves for active learning using Random Forest and Standard Deviation as query strategy vs Random Sampling. It can be seen that we conveniently beat random sampling by a large value using lesser points.
  • ...and 5 more figures