Missing Data Imputation Based on Dynamically Adaptable Structural Equation Modeling with Self-Attention
Ou Deng, Qun Jin
TL;DR
SESA addresses missing data in electronic health records by integrating a dynamically adaptable structural equation model with a self-attention module for imputation, framed to maximize mutual information between observed and missing values. It combines SEM-based latent structure with a transformer-style attention mechanism, using FIML for initialization and joint optimization via a composite loss that includes MSE, covariance preservation, and sparsity. Causal discovery via NOTEARS guides SEM initialization, and experiments on a CDC BRFSS-derived dataset show that SESA outperforms a range of baselines across sample sizes and variable types, with robust performance and meaningful improvements in imputation quality. This work advances healthcare data analytics by uniting statistical modeling and deep representation learning, offering a scalable and adaptable approach with potential extensions to nonlinear and temporal causal modeling.
Abstract
Addressing missing data in complex datasets including electronic health records (EHR) is critical for ensuring accurate analysis and decision-making in healthcare. This paper proposes dynamically adaptable structural equation modeling (SEM) using a self-attention method (SESA), an approach to data imputation in EHR. SESA innovates beyond traditional SEM-based methods by incorporating self-attention mechanisms, thereby enhancing model adaptability and accuracy across diverse EHR datasets. Such enhancement allows SESA to dynamically adjust and optimize imputation and overcome the limitations of static SEM frameworks. Our experimental analyses demonstrate the achievement of robust predictive SESA performance for effectively handling missing data in EHR. Moreover, the SESA architecture not only rectifies potential mis-specifications in SEM but also synergizes with causal discovery algorithms to refine its imputation logic based on underlying data structures. Such features highlight its capabilities and broadening applicational potential in EHR data analysis and beyond, marking a reasonable leap forward in the field of data imputation.
