Table of Contents
Fetching ...

A new approach to data assimilation initialization problems with sparse data using multiple cost functions

David J. Abers, George Hripcsak, Lena Mamykina, Melike Sirlanci, Esteban G. Tabak

Abstract

This article develops a novel data assimilation methodology, addressing challenges that are common in real-world settings, such as severe sparsity of observations, lack of reliable models, and non-stationarity of the system dynamics. These challenges often cause identifiability issues and can confound model parameter initialization, both of which can lead to estimated models with unrealistic qualitative dynamics and induce deeper parameter estimation errors. The proposed methodology's objective function is constructed as a sum of components, each serving a different purpose: enforcing point-wise and distribution-wise agreement between data and model output, enforcing agreement of variables and parameters with a model provided, and penalizing unrealistic rapid parameter changes, unless they are due to external drivers or interventions. This methodology was motivated by, developed and evaluated in the context of estimating blood glucose levels in different medical settings. Both simulated and real data are used to evaluate the methodology from different perspectives, such as its ability to estimate unmeasured variables, its ability to reproduce the correct qualitative blood glucose dynamics, how it manages known non-stationarity, and how it performs when given a range of dense and severely sparse data. The results show that a multicomponent cost function can balance the minimization of point-wise errors with global properties, robustly preserving correct qualitative dynamics and managing data sparsity.

A new approach to data assimilation initialization problems with sparse data using multiple cost functions

Abstract

This article develops a novel data assimilation methodology, addressing challenges that are common in real-world settings, such as severe sparsity of observations, lack of reliable models, and non-stationarity of the system dynamics. These challenges often cause identifiability issues and can confound model parameter initialization, both of which can lead to estimated models with unrealistic qualitative dynamics and induce deeper parameter estimation errors. The proposed methodology's objective function is constructed as a sum of components, each serving a different purpose: enforcing point-wise and distribution-wise agreement between data and model output, enforcing agreement of variables and parameters with a model provided, and penalizing unrealistic rapid parameter changes, unless they are due to external drivers or interventions. This methodology was motivated by, developed and evaluated in the context of estimating blood glucose levels in different medical settings. Both simulated and real data are used to evaluate the methodology from different perspectives, such as its ability to estimate unmeasured variables, its ability to reproduce the correct qualitative blood glucose dynamics, how it manages known non-stationarity, and how it performs when given a range of dense and severely sparse data. The results show that a multicomponent cost function can balance the minimization of point-wise errors with global properties, robustly preserving correct qualitative dynamics and managing data sparsity.

Paper Structure

This paper contains 35 sections, 78 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Estimating and forecasting glucose trajectories for two patients in the ICU. The plot on the left shows accurate estimation and forecasting almost immediately ($<1$ day, $<6$ data points ) while the plot on the right shows poor estimation and prediction until about day $5$ ($\sim 125$ data points). Sources leading to model estimation accuracy include a complex interplay between data sparsity leading to model identifiability problems (Figs. \ref{['fig:motivation1']}-\ref{['fig:motivation2']}) and model initialization.
  • Figure 1: Given the oscillatory ICU glycemic dynamics and the model basic state assuming oscillatory dynamics, we see the model estimating the simulated glucose measured according to $h_1$ (left), $h_2$ (center) and $h_3$ (right). We can see that even for the sparse data cases ($h_1$, $h_2$), the model produces oscillatory dynamics with reasonable mean and amplitude while for the densely measured case ($h_3$) the model tracked these data precisely. The point-wise estimates remain accurate in all cases. The dotted lines signal reconstruction for periods without data longer than a prescribed threshold.
  • Figure 2: In the ICU, insulin, one of the states that defines the glucose-insulin system and should be in the range of 25-400 picomoles per liter (pmol/l), is never measured. This can lead to model estimation and initialization problems, as seen in Fig. \ref{['fig:motivation0']}. Here we see the estimated interstitial and plasma insulin levels that are driving the forecasting errors seen in Fig. \ref{['fig:motivation0b']}. Note that after about $5$ days the model does eventually entrain to the patient and the insulin estimates take physiologically plausible values.
  • Figure 2: Given the oscillatory ICU glycemic dynamics and the model basic state assuming oscillatory dynamics, we see the model estimating the simulated glucose measured according to $h_1$ (left), $h_2$ (center) and $h_3$ (right). Note BG measured denotes data available to the model when it is estimated and estimated BG denotes the model-estimated invariant measure of the data. The densely measured case ($h_3$) is likely the closest representative of a gold standard baseline, again for data measured frequently in time. We can see that even for the sparse data cases ($h_1$, $h_2$), the model produced an accurate representation of the invariant measure that was not particularly dependent on the sparse measurement function while for the densely measured case ($h_3$) the model estimated all the details of the invariant measure well.
  • Figure 3: Patient $593$'s glucose trajectory (a) and insulin states (b) estimated with the constrained EnKF, the constrained parameter estimate trajectory (c), the percentage of particles violating the constraints per data point for estimated model states and parameters (d), and the individual ensemble particle trajectories for $593$'s estimated glucose trajectory (e).
  • ...and 8 more figures