Table of Contents
Fetching ...

MAPLE: Metadata Augmented Private Language Evolution

Eli Chien, Yuzheng Hu, Ryan McKenna, Shanshan Wu, Zheng Xu, Peter Kairouz

Abstract

While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model's parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model's pre-training priors--particularly in highly specialized domains--PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.

MAPLE: Metadata Augmented Private Language Evolution

Abstract

While differentially private (DP) fine-tuning of large language models (LLMs) is a powerful tool, it is often computationally prohibitive or infeasible when state-of-the-art models are only accessible via proprietary APIs. In such settings, generating DP synthetic data has emerged as a crucial alternative, offering the added benefits of arbitrary reuse across downstream tasks and transparent exploratory data analysis without the opaque constraints of a model's parameter space. Private Evolution (PE) is a promising API-based framework for this goal; however, its performance critically depends on initialization. When the private data distribution deviates substantially from the foundation model's pre-training priors--particularly in highly specialized domains--PE frequently struggles to align with the target data, resulting in degraded utility, poor convergence, and inefficient API usage. To address this initialization bottleneck, we propose Metadata Augmented Private Language Evolution (MAPLE). MAPLE leverages differentially private tabular metadata extraction and in-context learning to effectively ground the initial synthetic distribution in the target domain. Extensive experiments on challenging, domain-specific text generation tasks demonstrate that MAPLE achieves a significantly more favorable privacy-utility trade-off, converges faster, and drastically reduces API costs compared to previous PE methods.
Paper Structure (12 sections, 1 equation, 5 figures)

This paper contains 12 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Illustration of AugPE and its limitations. Top: Overview of AugPE, adapted from xie2024differentially. AugPE first uses RANDOM_API to generate synthetic samples with a data-independent prompt (Step 1). It then iteratively refines these synthetic samples toward the private samples. At each iteration, AugPE selects synthetic samples that receive high nearest-neighbor votes from private samples using a differentially private voting histogram in an embedding space (Step 2.1 and 2.2). The selected synthetic samples are subsequently paraphrased via VARIATION_API to produce new synthetic samples (Step 2.3). This process is repeated for multiple rounds. Bottom: When the initial distribution induced by RANDOM_API is poorly aligned with the private samples, AugPE may require many iterations to reach the region of the private data distribution. To address this limitation, we propose incorporating metadata into the RANDOM_API prompt, making the initialization data-dependent and better aligned for the PE process. Details on differentially private metadata extraction are in Appendix \ref{['apx:step0']} and Figure \ref{['fig:maple']}.
  • Figure 2: Overview of MAPLE. We first extract metadata in tabular format, either based on a designed schema or adapt it from the dataset whenever it is given. Then we train a light-weighted (with CPU only) DP metadata generator via the state-of-the-art approach, such as AIM mckenna2022aim. Next, we compose the prompt for RANDOM_API with both DP metadata and a few donated (metadata, text) pairs as in-context examples. As we demonstrate in the ablation study in Section \ref{['sec:exp']}, our in-context example design is crucial for fully leveraging the in-context learning capability of LLMs. After the initialization, we refine it by the PE (i.e., AugPE) for the final DP synthetic dataset.
  • Figure 3: Main results. Top: Biorxiv datasets. The utility metrics are MAUVE score, average JSD on all metadata annotated by a powerful LLM, and NTP accuracy for training a bert-small model using synthetic data. Bottom: OpenReview datasets. The utility metrics are downstream prediction accuracy for using synthetic data to train a roberta-base model following the same setting of xie2024differentially, where the predicted labels are the area and rating of the review. The mark $*$ indicates the results are directly taken from xie2024differentially.
  • Figure 4: Ablation study on Biorxiv dataset. +M: leveraging only metadata in the prompt. +E: leveraging only in-context examples in the prompt. MAPLE: leverage both in the prompt. 0 shot: direct prompting without PE iterations.
  • Figure 5: How metadata richness affects PE convergence, measured by MAUVE score on the Biorxiv dataset. Weak M uses only two metadata attributes from the full set.

Theorems & Definitions (1)

  • Definition 2.1