Latent variable model for high-dimensional point process with structured missingness
Maksim Sinelnikov, Manuel Haussmann, Harri Lähdesmäki
TL;DR
This work tackles high-dimensional longitudinal data with structured missingness and irregular sampling by proposing two probabilistic latent-variable models that integrate Gaussian process priors with deep encoders/decoders. The core idea is to model three interconnected latent streams: observations $z^y$, missingness masks $z^m$, and an auxiliary time process $z^{\lambda}$ via a temporal point process, with a structured GP additive kernel to capture covariate interactions. The two variants, LLSM and LLPPSM, differ in their use of a temporal point process and its intensity as input to the GP kernels, enabling improved imputation and future prediction under MNAR/MAR conditions. Evaluation on HealthMNIST variants and Physionet ICU data demonstrates competitive or superior performance across missingness patterns, highlighting the method’s potential for robust representation learning and downstream analysis in healthcare and other longitudinal domains.
Abstract
Longitudinal data are important in numerous fields, such as healthcare, sociology and seismology, but real-world datasets present notable challenges for practitioners because they can be high-dimensional, contain structured missingness patterns, and measurement time points can be governed by an unknown stochastic process. While various solutions have been suggested, the majority of them have been designed to account for only one of these challenges. In this work, we propose a flexible and efficient latent-variable model that is capable of addressing all these limitations. Our approach utilizes Gaussian processes to capture temporal correlations between samples and their associated missingness masks as well as to model the underlying point process. We construct our model as a variational autoencoder together with deep neural network parameterised encoder and decoder models, and develop a scalable amortised variational inference approach for efficient model training. We demonstrate competitive performance using both simulated and real datasets.
