Table of Contents
Fetching ...

Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research

Akshat Choube, Rahul Majethia, Sohini Bhattacharya, Vedant Das Swain, Jiachen Li, Varun Mishra

TL;DR

This paper addresses the overlooked problem of missing data in longitudinal passive health sensing by combining formative interviews with a case study on six GLOBEM datasets. It demonstrates that imputation strategy choice can substantially alter study outcomes, with Autoencoder-based imputation achieving up to $31\%$ improvements in $AUROC$ for depression prediction. The work introduces two novel imputation approaches and benchmarks them against existing baselines, showing meaningful gains in reconstruction quality and predictive performance, including robust real-time imputation capabilities. The findings advocate for integrating systematic, within-person imputation evaluation into the data pipeline and provide open-source tools to enable broader adoption and replication in longitudinal health research.

Abstract

Longitudinal passive sensing studies for health and behavior outcomes often have missing and incomplete data. Handling missing data effectively is thus a critical data processing and modeling step. Our formative interviews with researchers working in longitudinal health and behavior passive sensing revealed a recurring theme: most researchers consider imputation a low-priority step in their analysis and inference pipeline, opting to use simple and off-the-shelf imputation strategies without comprehensively evaluating its impact on study outcomes. Through this paper, we call attention to the importance of imputation. Using publicly available passive sensing datasets for depression, we show that prioritizing imputation can significantly impact the study outcomes -- with our proposed imputation strategies resulting in up to 31% improvement in AUROC to predict depression over the original imputation strategy. We conclude by discussing the challenges and opportunities with effective imputation in longitudinal sensing studies.

Imputation Matters: A Deeper Look into an Overlooked Step in Longitudinal Health and Behavior Sensing Research

TL;DR

This paper addresses the overlooked problem of missing data in longitudinal passive health sensing by combining formative interviews with a case study on six GLOBEM datasets. It demonstrates that imputation strategy choice can substantially alter study outcomes, with Autoencoder-based imputation achieving up to improvements in for depression prediction. The work introduces two novel imputation approaches and benchmarks them against existing baselines, showing meaningful gains in reconstruction quality and predictive performance, including robust real-time imputation capabilities. The findings advocate for integrating systematic, within-person imputation evaluation into the data pipeline and provide open-source tools to enable broader adoption and replication in longitudinal health research.

Abstract

Longitudinal passive sensing studies for health and behavior outcomes often have missing and incomplete data. Handling missing data effectively is thus a critical data processing and modeling step. Our formative interviews with researchers working in longitudinal health and behavior passive sensing revealed a recurring theme: most researchers consider imputation a low-priority step in their analysis and inference pipeline, opting to use simple and off-the-shelf imputation strategies without comprehensively evaluating its impact on study outcomes. Through this paper, we call attention to the importance of imputation. Using publicly available passive sensing datasets for depression, we show that prioritizing imputation can significantly impact the study outcomes -- with our proposed imputation strategies resulting in up to 31% improvement in AUROC to predict depression over the original imputation strategy. We conclude by discussing the challenges and opportunities with effective imputation in longitudinal sensing studies.

Paper Structure

This paper contains 34 sections, 2 equations, 5 figures, 4 tables, 3 algorithms.

Figures (5)

  • Figure 1: Density distribution of some features from different feature categories (steps, location, screen, sleep, and bluetooth for six GLOBEM datasets. The density distributions of most features tend to vary across different datasets. The features from sleep category are more similar as expected.
  • Figure 2: Comparison between imputation strategies for Reconstruction RMSE for Reorder feature set.
  • Figure 3: Comparison between imputation strategies for Reconstruction RMSE for Chikersal feature set.
  • Figure 4: Histograms and fitted Gaussian of balanced accuracies per participant in GLOBEM datasets when using Autoencoder-Median and GLOBEM-R as imputation strategies for Reorder algorithm.
  • Figure 5: Balanced Accuracy comparison of various algorithms for real-time inductive imputation in Reorder algorithm.