Table of Contents
Fetching ...

Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach

Gabriel Diaz Ramos, Lorenzo Luzi, Debshila Basu Mallick, Richard Baraniuk

Abstract

To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.

Stable and Privacy-Preserving Synthetic Educational Data with Empirical Marginals: A Copula-Based Approach

Abstract

To advance Educational Data Mining (EDM) within strict privacy-protecting regulatory frameworks, researchers must develop methods that enable data-driven analysis while protecting sensitive student information. Synthetic data generation is one such approach, enabling the release of statistically generated samples instead of real student records; however, existing deep learning and parametric generators often distort marginal distributions and degrade under iterative regeneration, leading to distribution drift and progressive loss of distributional support that compromise reliability. In response, we introduce the Non-Parametric Gaussian Copula (NPGC), a plug-and-play synthesis method that replaces deep learning and parametric optimization with empirical statistical anchoring to preserve the observed marginal distributions while modeling dependencies through a copula framework. NPGC integrates Differential Privacy (DP) at both the marginal and correlation levels, supports heterogeneous variable types, and treats missing data as an explicit state to retain informative absence patterns. We evaluate NPGC against deep learning and parametric baselines on five benchmark datasets and demonstrate that it remains stable across multiple regeneration cycles and achieves competitive downstream performance at substantially lower computational cost. We further validate NPGC through deployment in a real-world online learning platform, demonstrating its practicality for privacy-preserving research.

Paper Structure

This paper contains 29 sections, 29 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Marginal density of the Age variable in the Adult Adult dataset. The shaded region denotes the empirical density of the real data. Curves correspond to synthetic samples generated by a deep learning model, a Parametric Gaussian Copula (PGC), a Copula-GAN hybrid, and our proposed Non-Parametric Gaussian Copula (NPGC). Existing methods exhibit visible deviations from the empirical marginal, whereas NPGC aligns perfectly with the real distribution.
  • Figure 2: Marginal density of the Capital-gain variable in the Adult Adult dataset across three sequential regeneration iterations of a Tabular Variational Autoencoder (TVAE). The shaded region denotes the empirical density of the original data, and red curves (iterations 1–3) represent synthetic samples generated under repeated synthetic feedback. The progressive concentration of probability mass indicates variance collapse and loss of distributional support across iterations.
  • Figure 3: Overview of NPGC. FIT: Empirical marginal estimation via empirical Cumulative Distribution Functions (CDFs) with Gaussian projection, followed by correlation estimation $\widehat{R}$ with positive semi-definite (PSD) correction; the collection of marginals $\{\widehat{F}_j\}_{j=1}^p$, where $p$ denotes the number of features (shown as $\widehat{F}$ in the figure for clarity), and $\widehat{R}$ are stored. SAMPLE: Independent Gaussian samples are correlated using Cholesky factorization $LL^{\top}$ of $\widehat{R}$ and mapped back through inverse empirical CDF reconstruction.
  • Figure 4: Marginal density of the Admission grade variable in the Student Dropout Success Dropout dataset. The shaded region denotes the empirical density of the real data. Curves correspond to TGAN, TVAE, PGC, Copula-GAN, and the proposed NPGC. Deep learning and parametric baselines distort the marginal shape, while NPGC preserves it through empirical anchoring.
  • Figure 5: Marginal density of the Education-num variable in the Adult Adult dataset across three sequential regeneration iterations of NPGC. The shaded region denotes the empirical density of the original data, and purple curves (iterations 1–3) represent synthetic samples generated under repeated regeneration. The overlap across iterations demonstrates stability of the marginal distribution.
  • ...and 2 more figures