Table of Contents
Fetching ...

Dirichlet kernel density estimation on the simplex with missing data

Hanen Daayeb, Wissem Jedidi, Salah Khardani, Guanjie Lyu, Frédéric Ouimet

Abstract

Nonparametric density estimation for compositional data supported on the simplex is examined under a missing at random mechanism. Rather than imputing missing values and estimating the density from a completed data set, we adopt a strategy based on inverse probability weighting. The proposed estimator uses an adaptive Dirichlet kernel, which ensures nonnegativity on the simplex and favorable behavior near the boundary. When the observation probabilities are unknown, they are estimated through a Nadaraya-Watson regression step. The large-sample properties of the estimator are derived, including pointwise bias and variance expansions, optimal smoothing rates, and asymptotic normality. A simulation study investigates its finite-sample performance under varying sample sizes and missing rates. Simulations show our method outperforms inverse-probability-weighted kernel density estimators based on additive and isometric log-ratio transformations of the data for certain target densities. The methodology is further illustrated through an application to leukocyte composition data from the National Health and Nutrition Examination Survey (NHANES), which allows for the identification of the modal immune profile in the sampled population.

Dirichlet kernel density estimation on the simplex with missing data

Abstract

Nonparametric density estimation for compositional data supported on the simplex is examined under a missing at random mechanism. Rather than imputing missing values and estimating the density from a completed data set, we adopt a strategy based on inverse probability weighting. The proposed estimator uses an adaptive Dirichlet kernel, which ensures nonnegativity on the simplex and favorable behavior near the boundary. When the observation probabilities are unknown, they are estimated through a Nadaraya-Watson regression step. The large-sample properties of the estimator are derived, including pointwise bias and variance expansions, optimal smoothing rates, and asymptotic normality. A simulation study investigates its finite-sample performance under varying sample sizes and missing rates. Simulations show our method outperforms inverse-probability-weighted kernel density estimators based on additive and isometric log-ratio transformations of the data for certain target densities. The methodology is further illustrated through an application to leukocyte composition data from the National Health and Nutrition Examination Survey (NHANES), which allows for the identification of the modal immune profile in the sampled population.
Paper Structure (26 sections, 10 theorems, 135 equations, 9 figures, 2 tables)

This paper contains 26 sections, 10 theorems, 135 equations, 9 figures, 2 tables.

Key Result

Proposition 4.1

Suppose that Assumptions ass:1, ass:3, and ass:5 hold. Uniformly for $\boldsymbol{s}\in \mathcal{S}_d$, we have where

Figures (9)

  • Figure 1: Visualization of the MAR mechanisms for Model I (left panel) and Model II (right panel).
  • Figure 2: Contour plots of the Model I target density $f$ (left panel) and the associated Dirichlet kernel density estimate $\hat{f}_{n,0.05}$ (right panel), with a sample size $n = 2000$ and a $10\%$ missing rate.
  • Figure 3: Contour plots of the Model II target density $f$ (left panel) and the associated Dirichlet kernel density estimate $\hat{f}_{n,0.05}$ (right panel), with a sample size $n = 2000$ and a $10\%$ missing rate.
  • Figure 4: Mean, median, standard deviation, and interquartile range of $1000$ ISEs in Model I for the IPW Dirichlet KDE as a function of the proportion of missing data, shown for four sample sizes $n\in \{100, 200, 400, 800\}$.
  • Figure 5: Mean, median, standard deviation, and interquartile range of $1000$ ISEs in Model II for the IPW Dirichlet KDE as a function of the proportion of missing data, shown for four sample sizes $n\in \{100, 200, 400, 800\}$.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Proposition 4.1: Pointwise bias
  • Proposition 4.2: Pointwise variance
  • Corollary 4.3: Mean squared error
  • Theorem 4.4: Asymptotic normality
  • Proposition 4.5: Pointwise bias
  • Proposition 4.6: Pointwise variance
  • Corollary 4.7: Mean squared error
  • Theorem 4.8: Asymptotic normality
  • Remark 1
  • Lemma 9.1
  • ...and 3 more