Table of Contents
Fetching ...

Smooth Flow Matching

Jianbin Tan, Anru R. Zhang

TL;DR

This work introduces a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data, and applies it to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database.

Abstract

Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.

Smooth Flow Matching

TL;DR

This work introduces a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data, and applies it to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database.

Abstract

Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.

Paper Structure

This paper contains 33 sections, 11 theorems, 145 equations, 8 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Assume $X(t)$ and $Z(t)$ are continuous random variables, $t\in \mathcal{T}$. $X(\cdot)$ is a copula process with a base $Z(\cdot)$ if and only if there exists a family of continuous and strictly increasing functions $\{g_t\colon {\rm supp}(Z(t)) \to {\rm supp}(X(t)) \}_{t \in \mathcal{T}}$ such tha

Figures (8)

  • Figure 1: A pictorial illustration of functional data generation via three-dimensional flow. The left panel shows the base function input into the flow, while the right panel displays the output functions generated by applying the base functions through the flow transformation.
  • Figure 2: Illustration of the data fitting and generation process. Samples from different subjects are shown in different colors, and arrows indicate the sample matching across time in the training process.
  • Figure 3: Left panel: Box plots of Wasserstein distances from 100 simulation replications under varying sample sizes $n$ (subtitles) and numbers of observed time points $J_i$ (main titles). Right panel: average computation times.
  • Figure 4: Illustration of generated functional data under different numbers of observed time points (main titles) and different generation methods (subtitles). The sample size is set to $n = 100$ for data generation.
  • Figure 5: (A) Irregularly observed functional data for each clinical feature; (B) Synthetic functional data generated by SFM with Gaussian bases; (C) Synthetic functional data generated by SFM with Student-$t$ bases; (D) Synthetic functional data generated by DSM; (E) Synthetic functional data generated by FM.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Definition 1: Copula Process
  • Example 1: Gaussian Copula Process
  • Proposition 1: An Equivalent Condition for Copula Processes
  • Definition 2: Vector Field for Continuous Normalizing Flow
  • Theorem 1: Existence and Uniqueness of Solution to \ref{['flow_def']}
  • Remark 1: Flow Models via Continuous Normalizing Flows
  • Remark 2: Continuous Normalizing Flows for Functional Data Generation
  • Theorem 2: Smooth Function Generation via Smooth Flow
  • Example 2: Gaussian Bases with Smooth Covariance Functions
  • Theorem 3
  • ...and 18 more