Table of Contents
Fetching ...

Estimation of Functional Principal Components from Sparse Functional Data

Uche Mbaka, Jiguo Cao, Michelle Carey

Abstract

Sparse functional data arise when measurements are observed infrequently and at irregular time points for each subject, often in the presence of measurement error. These characteristics introduce additional challenges for functional principal component analysis. In this paper, we propose a new approach for extracting functional principal components from such data by combining basis expansion with maximum likelihood estimation. Orthogonality of the estimated eigenfunctions is preserved throughout the optimization using modified Gram-Schmidt orthonormalization. An information criterion is proposed to select both the optimal number of basis functions and the rank of the covariance structure. Principal component scores are subsequently estimated via conditional expectation, enabling accurate reconstruction of the underlying functional trajectories across the full domain despite sparse observations. Simulation studies demonstrate the effectiveness of the proposed method and show that it performs favorably compared with existing approaches. Its practical utility is illustrated through applications to CD4 cell count data from the Multicenter AIDS Cohort Study and somatic cell count data from Irish research dairy cattle. Supplementary materials, including technical details, additional simulation results, and the R package mGSFPCA, are available online.

Estimation of Functional Principal Components from Sparse Functional Data

Abstract

Sparse functional data arise when measurements are observed infrequently and at irregular time points for each subject, often in the presence of measurement error. These characteristics introduce additional challenges for functional principal component analysis. In this paper, we propose a new approach for extracting functional principal components from such data by combining basis expansion with maximum likelihood estimation. Orthogonality of the estimated eigenfunctions is preserved throughout the optimization using modified Gram-Schmidt orthonormalization. An information criterion is proposed to select both the optimal number of basis functions and the rank of the covariance structure. Principal component scores are subsequently estimated via conditional expectation, enabling accurate reconstruction of the underlying functional trajectories across the full domain despite sparse observations. Simulation studies demonstrate the effectiveness of the proposed method and show that it performs favorably compared with existing approaches. Its practical utility is illustrated through applications to CD4 cell count data from the Multicenter AIDS Cohort Study and somatic cell count data from Irish research dairy cattle. Supplementary materials, including technical details, additional simulation results, and the R package mGSFPCA, are available online.
Paper Structure (18 sections, 17 equations, 5 figures, 6 tables)

This paper contains 18 sections, 17 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Spaghetti plot for sparsely recorded CD4 count data (the number of CD4 T lymphocytes in a sample of blood) for 6 subjects, each shown in a different color.
  • Figure 2: Simulated functional data with observed points $Y(t)$ (dots) and true curves $X(t)$ (lines). Left: The Matérn covariance with $m_i \sim U_{d}(5,15)$ and $\sigma^2 = 1$. Right: The Egg Crate covariance with $m_i \sim U_{d}(3,7)$ and $\sigma^2 = 0.25$.
  • Figure 3: True (solid lines) and estimated (dotted lines) principal eigenfunctions and sample trajectories for the cubic b-spline simulation.
  • Figure 4: Estimates of the mean, eigenfunctions, covariance surface, and sample subject trajectories of the CD4 cell count data. Top row left: Estimated mean (thick dark line) and sample trajectories (thin gray lines); Top row middle: Estimated eigenfunctions: $\hat{\phi}_1$ (solid line), $\hat{\phi}_2$ (dashed line), and $\hat{\phi}_3$ (dotted line); Top row right: Estimated covariance contour plot; Bottom row panels: Observations (circles), predicted trajectories (solid line), and 95% point-wise confidence bands (dashed lines) for selected subjects. These subjects correspond to those with the largest absolute scores for each of the three principal components respectively.
  • Figure 5: Estimates of the mean, eigenfunctions, covariance surface, and sample subject trajectories of the Somatic cell score data. Top row left: Estimated mean (thick dark line) and sample trajectories (thin gray lines); Top row middle: Estimated eigenfunctions: $\hat{\phi}_1$ (solid line), $\hat{\phi}_2$ (dashed line), and $\hat{\phi}_3$ (dotted line); Top row right: Estimated covariance contour plot; Bottom row panels: Observations (circles), predicted trajectories (solid line), and 95% point-wise confidence bands (dashed lines) for selected subjects. These subjects correspond to those with the largest absolute scores for each of the three principal components, respectively.