Table of Contents
Fetching ...

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field

TL;DR

SynthTextEval addresses privacy-sensitive text data in high-stakes domains by providing an open-source toolkit to generate synthetic text with optional differential privacy guarantees (via $DP$-$S$GD) and to evaluate it with a unified metric suite across $utility$, $fairness$, $privacy$, and $quality$. The approach standardizes comparisons using $D_{ ext{$epsilon$= extinfty}}$ and $D_{ ext{$epsilon$=8}}$ synthetic data, and metrics such as downstream task performance, equalized odds ($EO$), equality difference ($ED$), canary-based leakage, Entity Leakage Percentage ($ELP$), MAUVE, and Fréchet Inception Distance ($FID$). Validation on TAB and healthcare/$MIMIC$-i2b2 case studies demonstrates typical privacy-utility trade-offs and the value of standardized auditing in regulated settings. The toolkit is domain-agnostic and extensible, enabling rigorous privacy-preserving synthetic-text development and auditing to support trustworthy AI in sensitive environments.

Abstract

We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains

TL;DR

SynthTextEval addresses privacy-sensitive text data in high-stakes domains by providing an open-source toolkit to generate synthetic text with optional differential privacy guarantees (via -GD) and to evaluate it with a unified metric suite across , , , and . The approach standardizes comparisons using epsilon and epsilon synthetic data, and metrics such as downstream task performance, equalized odds (), equality difference (), canary-based leakage, Entity Leakage Percentage (), MAUVE, and Fréchet Inception Distance (). Validation on TAB and healthcare/-i2b2 case studies demonstrates typical privacy-utility trade-offs and the value of standardized auditing in regulated settings. The toolkit is domain-agnostic and extensible, enabling rigorous privacy-preserving synthetic-text development and auditing to support trustworthy AI in sensitive environments.

Abstract

We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerous applications, such as reducing the risks of privacy violations in the development and deployment of AI systems in high-stakes domains. Realizing this potential, however, requires principled consistent evaluations of synthetic data across multiple dimensions: its utility in downstream systems, the fairness of these systems, the risk of privacy leakage, general distributional differences from the source text, and qualitative feedback from domain experts. SynthTextEval allows users to conduct evaluations along all of these dimensions over synthetic data that they upload or generate using the toolkit's generation module. While our toolkit can be run over any data, we highlight its functionality and effectiveness over datasets from two high-stakes domains: healthcare and law. By consolidating and standardizing evaluation metrics, we aim to improve the viability of synthetic text, and in-turn, privacy-preservation in AI development.

Paper Structure

This paper contains 25 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Architecture overview of SynthTextEval.
  • Figure 2: Our visual interface supporting exploration, comparison, and annotation of synthetic and real text.
  • Figure 3: Memorization of private entities in the TAB dataset as context window length increases.