Table of Contents
Fetching ...

Training-Free Private Synthesis with Validation: A New Frontier for Practical Educational Data Sharing

Hibiki Ito, Chia-Yu Hsu, Hiroaki Ogata

Abstract

While secondary use of real-world data (RWD) in education offers substantial research opportunities, data sharing is often limited by privacy constraints. Differentially private synthetic data generation (DP-SDG) has emerged as a possible solution. However, educational RWD is fragmented across platforms and institutions and stored in different formats, so DP-SDG must be tailored to each dataset, requiring substantial engineering effort. In addition, such data are often small-sample and high-dimensional, making deep learning (DL)-based methods common but difficult to implement without specialist expertise. In this setting, it is also hard to achieve practically useful downstream utility. As a result, despite its theoretical promise, DP-SDG remains far from a practical solution in education. To address this issue, we propose a more practical two-stage method: (1) training-free, LLM-based DP-SDG is performed for sharing synthetic data and (2) on-demand real-data validation, where researchers submit code for remote validation of results. This simple method is designed for individual data custodians without extensive DP-SDG expertise. It can also be adapted to multi-shot synthesis, where data from different learner cohorts are synthesised regularly. We evaluate this method experimentally in both the one-shot and multi-shot synthesis settings using RWD collected over three years and conduct a case study with real researchers. Results show that LLM-based DP-SDG performs comparably to a DL-based baseline while greatly reducing engineering costs, and that non-DP validation causes measurable but moderate privacy leakage. Nonetheless, in the case study researchers reported that on average only 36% of synthetic findings are validated on real data. Overall, the paper provides a practical method for sharing educational RWD, while highlighting challenges in risk mitigation and epistemic precision.

Training-Free Private Synthesis with Validation: A New Frontier for Practical Educational Data Sharing

Abstract

While secondary use of real-world data (RWD) in education offers substantial research opportunities, data sharing is often limited by privacy constraints. Differentially private synthetic data generation (DP-SDG) has emerged as a possible solution. However, educational RWD is fragmented across platforms and institutions and stored in different formats, so DP-SDG must be tailored to each dataset, requiring substantial engineering effort. In addition, such data are often small-sample and high-dimensional, making deep learning (DL)-based methods common but difficult to implement without specialist expertise. In this setting, it is also hard to achieve practically useful downstream utility. As a result, despite its theoretical promise, DP-SDG remains far from a practical solution in education. To address this issue, we propose a more practical two-stage method: (1) training-free, LLM-based DP-SDG is performed for sharing synthetic data and (2) on-demand real-data validation, where researchers submit code for remote validation of results. This simple method is designed for individual data custodians without extensive DP-SDG expertise. It can also be adapted to multi-shot synthesis, where data from different learner cohorts are synthesised regularly. We evaluate this method experimentally in both the one-shot and multi-shot synthesis settings using RWD collected over three years and conduct a case study with real researchers. Results show that LLM-based DP-SDG performs comparably to a DL-based baseline while greatly reducing engineering costs, and that non-DP validation causes measurable but moderate privacy leakage. Nonetheless, in the case study researchers reported that on average only 36% of synthetic findings are validated on real data. Overall, the paper provides a practical method for sharing educational RWD, while highlighting challenges in risk mitigation and epistemic precision.

Paper Structure

This paper contains 30 sections, 3 theorems, 16 equations, 8 figures, 2 tables.

Key Result

Proposition 1

Let ${\mathcal{A}}$ be an algorithm with domain ${\mathcal{D}}$, ${\mathbb{D}}$ be a distribution over ${\mathcal{D}}$, and $n \in {\mathbb{N}}$ be an arbitrary dataset size. Assume ${\mathcal{D}}$ is closed and bounded. Then for all $\alpha\in[0,1]$ there exists $(x^*, y_{x^*})\in{\mathcal{D}}$ suc $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: The proposed two-stage method in a multi-shot setting. $D_1,D_2,\dots$ are private datasets of different learner cohorts. The initial cycle corresponds to a standard one-shot setting. From the second cycle, researchers give feedback on how findings on synthetic and real data differed in the prior cycle.
  • Figure 2: Prompt template provided to LLMs to generate synthetic data. {{label}} indicates academic achievement class (high, average, low) and {{zero props}} are calculated from real data with the Gaussian mechanism. When generating surrogate public data for DL baseline, zero proportions are not given. When the two-stage method is adapted to CAPS, suggestions to improve synthesis from the previous cycle are appended to the prompt.
  • Figure 3: The average Jensen-Shannon divergence (AJS) between real and synthetic data for the one-shot synthesis by the baseline DL-based method and the proposed LLM-based method. Lower AJS indicates better fidelity.
  • Figure 4: Gains in the average Jensen-Shannon divergence (AJS) by adapting the DL-based baseline method and the proposed two-stage method to the multi-shot synthesis setting. Note that the DP guarantee of generated data for the two-stage method is only with respect to the real data of that year because the generation process uses non-DP validation results of the previous year.
  • Figure 5: The empirical $\mu$-GWMIP with respect to $n=120$ and the underlying distribution for the two-stage sharing at each request index in the chronological order. $\mu$ at request $R$ accounts for DP-SDG and releasing the set of requests $1,\dots,R$. The red dashed lines indicate the privacy guarantee for which the DP-SDG alone provides without real-data validation (i.e. the case of Stage 1 only).
  • ...and 3 more figures

Theorems & Definitions (11)

  • Definition 1: Differential privacy Dwork2006approxDP
  • Definition 2: Trade-off functions Dong2022GDP
  • Definition 3: $f$-differential privacy Dong2022GDP
  • Definition 4: $\mu$-Gaussian differential privacy Dong2022GDP
  • Definition 5: Per-example MI game adopted from Ye2022MIA
  • Definition 6: $f_{n,{\mathbb{D}}}$-worst-case membership inference privacy Ito2024thesis
  • Definition 7: Gaussian worst-case membership inference privacy Ito2024thesis
  • Proposition 1: Existence of the worst-case exampleIto2024thesis
  • Proposition 2
  • proof
  • ...and 1 more