Table of Contents
Fetching ...

Joint Modeling of Longitudinal EHR Data with Shared Random Effects for Informative Visiting and Observation Processes

Cheng-Han Yang, Xu Shi, Bhramar Mukherjee

TL;DR

A unified semiparametric joint modeling framework is proposed that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes, and central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components.

Abstract

Longitudinal electronic health record (EHR) data offer opportunities to study biomarker trajectories; however, association estimates-the primary inferential target-from standard models designed for regular observation times may be biased by a two-stage hierarchical missingness mechanism. The first stage is the visiting process (informative presence), where encounters occur at irregular times driven by patient health status; the second is the observation process (informative observation), where biomarkers are selectively measured during visits. To address these mechanisms, we propose a unified semiparametric joint modeling framework that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes. Central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components. We develop a three-stage estimation procedure and establish the consistency and asymptotic normality of our estimators. We also introduce a sequential procedure that imputes missing biomarkers prior to adjusting for irregular visiting and examine its performance. Simulation results demonstrate that our method yields unbiased estimates under this mechanism, whereas existing approaches can be substantially biased; notably, methods adjusting only for irregular visiting may exhibit even greater bias than those ignoring both mechanisms. We apply our framework to data from the All of Us Research Program to investigate associations between neighborhood-level socioeconomic status indicators and six blood-based biomarker trajectories, providing a robust tool for outpatient settings where irregular monitoring and selective measurement are prevalent.

Joint Modeling of Longitudinal EHR Data with Shared Random Effects for Informative Visiting and Observation Processes

TL;DR

A unified semiparametric joint modeling framework is proposed that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes, and central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components.

Abstract

Longitudinal electronic health record (EHR) data offer opportunities to study biomarker trajectories; however, association estimates-the primary inferential target-from standard models designed for regular observation times may be biased by a two-stage hierarchical missingness mechanism. The first stage is the visiting process (informative presence), where encounters occur at irregular times driven by patient health status; the second is the observation process (informative observation), where biomarkers are selectively measured during visits. To address these mechanisms, we propose a unified semiparametric joint modeling framework that simultaneously characterizes the visiting, biomarker observation, and longitudinal outcome processes. Central to this framework is a shared subject-specific Gaussian latent variable that captures unmeasured frailty and induces dependence across all components. We develop a three-stage estimation procedure and establish the consistency and asymptotic normality of our estimators. We also introduce a sequential procedure that imputes missing biomarkers prior to adjusting for irregular visiting and examine its performance. Simulation results demonstrate that our method yields unbiased estimates under this mechanism, whereas existing approaches can be substantially biased; notably, methods adjusting only for irregular visiting may exhibit even greater bias than those ignoring both mechanisms. We apply our framework to data from the All of Us Research Program to investigate associations between neighborhood-level socioeconomic status indicators and six blood-based biomarker trajectories, providing a robust tool for outpatient settings where irregular monitoring and selective measurement are prevalent.
Paper Structure (72 sections, 18 theorems, 166 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 72 sections, 18 theorems, 166 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 4.1

Under Assumptions asmp:censoring--asmp:distributions and regularity conditions (C1)--(C6), let $(\widehat{\bm\beta},\widehat{\bm\theta})$ be the solution to the estimating equations eq:EE_outcome, where the conditional expectations involving $U_i$ are evaluated using the Laplace approximation. Then,

Figures (8)

  • Figure 1: Illustration of the hierarchical data generation process involving Informative Presence (IP) and Informative Observation (IO). Left panel: Patient timelines where clinic visits (ticks) are generated by the visiting process driven by covariates $\bm{X}_i^{\mathcal{V}}$. At each visit, the observation process (driven by $\bm{X}_i^{\mathcal{O}}(t)$) determines whether the biomarker outcome $Y_i(t)$ is measured (solid dots, $R_i^{\mathcal{Y}}(t)=1$) or unmeasured (hollow circles, $R_i^{\mathcal{Y}}(t)=0$), while the underlying longitudinal biomarker trajectory is driven by $\bm{X}_i^{\mathcal{Y}}(t)$. Right panel: The resulting long-format dataset used for analysis, where "NA" in the $Y_i(t)$ column indicates an unmeasured outcome ($R_i^{\mathcal{Y}}(t)=0$) despite the patient's presence at the clinic.
  • Figure 2: Three-stage estimation procedure. Gray shaded boxes indicate latent information transmitted across stages. Stages 1 and 2 estimate nuisance parameters to estimate the empirical Bayes posterior and marginalized observation probabilities, respectively; these quantities are subsequently incorporated into Stage 3 to correct for clinically informed bias.
  • Figure 3: Evaluation of $\beta_F$ estimator performance across Setting A (Scenarios A.1--A.4). Top: Empirical bias of $\widehat{\beta}_F$ (dashed line at 0). Bottom: RMSE of $\widehat{\beta}_F$. Boxplots summarize the distributions across replicates. Estimators are grouped by modeling approach and distinguished by color: Outcome-only (green), IP-only (blue), imputation+IP (orange), and IP+IO (red).
  • Figure 4: Evaluation of $\beta_F$ estimator performance across Setting B (Scenarios B.1--B.6). Top: Empirical bias of $\widehat{\beta}_F$ (dashed line at 0). Bottom: RMSE of $\widehat{\beta}_F$. Boxplots summarize the distributions across replicates. Estimators are grouped by modeling approach and distinguished by color: Outcome-only (green), IP-only (blue), imputation+IP (orange), and IP+IO (red).
  • Figure 5: Evaluation of $\beta_F$ estimator performance across Setting C (Scenarios C.1--C.6). Top: Empirical bias of $\widehat{\beta}_F$ (dashed line at 0). Bottom: RMSE of $\widehat{\beta}_F$. Boxplots summarize the distributions across replicates. Estimators are grouped by modeling approach and distinguished by color: Outcome-only (green), IP-only (blue), imputation+IP (orange), and IP+IO (red).
  • ...and 3 more figures

Theorems & Definitions (39)

  • Remark 3.1
  • Theorem 4.1: Consistency
  • Theorem 4.2: Asymptotic normality
  • Lemma S1: NHPP order-statistics identity
  • proof
  • Remark S2.1: Independence from the frailty
  • Lemma S2: Martingale compensation
  • proof
  • Lemma S3: Probit-normal convolution
  • proof
  • ...and 29 more