Table of Contents
Fetching ...

Cyclic Adaptive Private Synthesis for Sharing Real-World Data in Education

Hibiki Ito, Chia-Yu Hsu, Hiroaki Ogata

TL;DR

The paper addresses the challenge of sharing educational real-world data (RWD) under $(\varepsilon,\delta)$-DP privacy guarantees. It introduces the Cyclic Adaptive Private Synthesis (CAPS) framework, which uses a two-component variational autoencoder (M1 unconditional, M2 conditional) trained with semi-private semi-supervised learning to enable iterative, DP-compliant data release across cohorts. In a case study with three years of K-12 learning-habits data, CAPS shows rising downstream utility and reconstruction quality over cycles compared to a one-shot baseline, while revealing a compounding bias in conditional generation and discussing privacy accounting limitations. The work demonstrates a practical path toward open science and design-based research in learning analytics while highlighting ethical and methodological caveats that require further investigation.

Abstract

The rapid adoption of digital technologies has greatly increased the volume of real-world data (RWD) in education. While these data offer significant opportunities for advancing learning analytics (LA), secondary use for research is constrained by privacy concerns. Differentially private synthetic data generation is regarded as the gold-standard approach to sharing sensitive data, yet studies on the private synthesis of educational data remain very scarce and rely predominantly on large, low-dimensional open datasets. Educational RWD, however, are typically high-dimensional and small in sample size, leaving the potential of private synthesis underexplored. Moreover, because educational practice is inherently iterative, data sharing is continual rather than one-off, making a traditional one-shot synthesis approach suboptimal. To address these challenges, we propose the Cyclic Adaptive Private Synthesis (CAPS) framework and evaluate it on authentic RWD. By iteratively sharing RWD, CAPS not only fosters open science, but also offers rich opportunities of design-based research (DBR), thereby amplifying the impact of LA. Our case study using actual RWD demonstrates that CAPS outperforms a one-shot baseline while highlighting challenges that warrant further investigation. Overall, this work offers a crucial first step towards privacy-preserving sharing of educational RWD and expands the possibilities for open science and DBR in LA.

Cyclic Adaptive Private Synthesis for Sharing Real-World Data in Education

TL;DR

The paper addresses the challenge of sharing educational real-world data (RWD) under -DP privacy guarantees. It introduces the Cyclic Adaptive Private Synthesis (CAPS) framework, which uses a two-component variational autoencoder (M1 unconditional, M2 conditional) trained with semi-private semi-supervised learning to enable iterative, DP-compliant data release across cohorts. In a case study with three years of K-12 learning-habits data, CAPS shows rising downstream utility and reconstruction quality over cycles compared to a one-shot baseline, while revealing a compounding bias in conditional generation and discussing privacy accounting limitations. The work demonstrates a practical path toward open science and design-based research in learning analytics while highlighting ethical and methodological caveats that require further investigation.

Abstract

The rapid adoption of digital technologies has greatly increased the volume of real-world data (RWD) in education. While these data offer significant opportunities for advancing learning analytics (LA), secondary use for research is constrained by privacy concerns. Differentially private synthetic data generation is regarded as the gold-standard approach to sharing sensitive data, yet studies on the private synthesis of educational data remain very scarce and rely predominantly on large, low-dimensional open datasets. Educational RWD, however, are typically high-dimensional and small in sample size, leaving the potential of private synthesis underexplored. Moreover, because educational practice is inherently iterative, data sharing is continual rather than one-off, making a traditional one-shot synthesis approach suboptimal. To address these challenges, we propose the Cyclic Adaptive Private Synthesis (CAPS) framework and evaluate it on authentic RWD. By iteratively sharing RWD, CAPS not only fosters open science, but also offers rich opportunities of design-based research (DBR), thereby amplifying the impact of LA. Our case study using actual RWD demonstrates that CAPS outperforms a one-shot baseline while highlighting challenges that warrant further investigation. Overall, this work offers a crucial first step towards privacy-preserving sharing of educational RWD and expands the possibilities for open science and DBR in LA.
Paper Structure (24 sections, 1 theorem, 10 equations, 3 figures, 2 tables)

This paper contains 24 sections, 1 theorem, 10 equations, 3 figures, 2 tables.

Key Result

proposition 1

If an algorithm ${\mathcal{A}}$ satisfies $(\varepsilon, \delta)$-DP, then a post-processing $\mathrm{Proc}\circ{\mathcal{A}}$ is also $(\varepsilon, \delta)$-DP.

Figures (3)

  • Figure 1: Overview of the proposed CAPS framework. $D_t$ are private datasets for cycles $t=1,2,\dots$ which we wish to share with third parties. The generative model M1+M2 is trained by semi-private semi-supervised learning (SPSSL) to share the synthetic data or the model itself under DP guarantee.
  • Figure 2: Performance of generative models in academic achievement prediction for different privacy parameters and cycles within the CAPS framework. The shaded areas indicate 95% confidence intervals. $\varepsilon=\infty$ is the non-DP baseline.
  • Figure 3: AJS divergence defined in \ref{['eq:ajs']} between real and synthetic data (a) reconstructed from the real data and (b) conditionally generated by sampling from the prior. The shaded areas indicate 95% confidence intervals. $\varepsilon=\infty$ is the non-DP baseline.

Theorems & Definitions (3)

  • definition 1: Differential privacy Dwork2006approxDP
  • proposition 1: Post-processing Dwork2014foundations
  • definition 2: Public data