Table of Contents
Fetching ...

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Brian Belgodere, Pierre Dognin, Adam Ivankay, Igor Melnyk, Youssef Mroueh, Aleksandra Mojsilovic, Jiri Navratil, Apoorva Nitsure, Inkit Padhi, Mattia Rigotti, Jerret Ross, Yair Schiff, Radhika Vedpathak, Richard A. Young

TL;DR

This work proposes a holistic auditing framework for synthetic data that jointlyevaluates fidelity, privacy, utility, fairness, and robustness across modalities and data splits. It introduces a trustworthiness index built from per-dimension indices via copula-based aggregation and ECDF normalization, enabling context-specific ranking and cross-validated model selection through TrustFormers. The framework is demonstrated on tabular, time-series, NLP, and vision-like data, including healthcare (MIMIC-III) and fraud detection use cases, showing that carefully selected synthetic data can match or exceed real data in key trust dimensions while respecting privacy and fairness constraints. Overall, the approach provides transparent auditing reports and governance-ready workflows, offering practical tools for regulatory compliance and safer deployment of synthetic data pipelines.

Abstract

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

TL;DR

This work proposes a holistic auditing framework for synthetic data that jointlyevaluates fidelity, privacy, utility, fairness, and robustness across modalities and data splits. It introduces a trustworthiness index built from per-dimension indices via copula-based aggregation and ECDF normalization, enabling context-specific ranking and cross-validated model selection through TrustFormers. The framework is demonstrated on tabular, time-series, NLP, and vision-like data, including healthcare (MIMIC-III) and fraud detection use cases, showing that carefully selected synthetic data can match or exceed real data in key trust dimensions while respecting privacy and fairness constraints. Overall, the approach provides transparent auditing reports and governance-ready workflows, offering practical tools for regulatory compliance and safer deployment of synthetic data pipelines.

Abstract

Real-world data often exhibits bias, imbalance, and privacy risks. Synthetic datasets have emerged to address these issues. This paradigm relies on generative AI models to generate unbiased, privacy-preserving data while maintaining fidelity to the original data. However, assessing the trustworthiness of synthetic datasets and models is a critical challenge. We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models. It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation. We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases like education, healthcare, banking, and human resources, spanning different data modalities such as tabular, time-series, vision, and natural language. This holistic assessment is essential for compliance with regulatory safeguards. We introduce a trustworthiness index to rank synthetic datasets based on their safeguards trade-offs. Furthermore, we present a trustworthiness-driven model selection and cross-validation process during training, exemplified with "TrustFormers" across various data types. This approach allows for controllable trustworthiness trade-offs in synthetic data creation. Our auditing framework fosters collaboration among stakeholders, including data scientists, governance experts, internal reviewers, external certifiers, and regulators. This transparent reporting should become a standard practice to prevent bias, discrimination, and privacy violations, ensuring compliance with policies and providing accountability, safety, and performance guarantees.
Paper Structure (60 sections, 19 equations, 16 figures, 26 tables, 1 algorithm)

This paper contains 60 sections, 19 equations, 16 figures, 26 tables, 1 algorithm.

Figures (16)

  • Figure 1: Summary diagram of our proposed holistic synthetic data auditing framework. For each trust dimension (fidelity, privacy, utility, fairness, and robustness), we evaluate multiple metrics on the synthetic data and quantify their uncertainty. Metrics are aggregated within each trust dimension, which results in trust dimension indices. These indices are re-weighted with desired trust trade-offs to produce the trustworthiness index. Different synthetic datasets are then ranked using this trustworthiness index, and a summary of the audit is written to an audit report. The ranking produced by our audit enables comparison of different synthetic data produced by various generative modeling techniques, and aids the model selection process for a given generation technique, allowing its alignment with prescribed safeguards. The model selection is performed via trustworthiness index driven cross-validation, which results in controllable trust trade-offs by producing new ranks for different desired weighing trade-offs for a given application and use case.
  • Figure 1: Auditing the trustworthiness synthetic data form metric evaluations on trust dimensions to the audit report.
  • Figure 2: Auditing Platform and workflows connecting different stakeholders (e.g., data scientists, data governance experts, internal reviewers, external certifiers, and regulators) from model development to audit and certification via a synthetic data auditing report.
  • Figure 2: Metric normalization via the empirical CDF transform.
  • Figure 3: Summary of auditing and ranking results on the Bank Marketing dataset using the trustworthiness index given in \ref{['eq:selrulealpha']} for $\alpha=0$. (a) and (b) show trust dimension indices $\pi_{T}$( where "T" corresponds to Fidelity, Privacy, Utility, Fairness, or Robustness), and their "variance" ($\Delta_{T}$) on TrustFormer (TF) and baseline models. The format is $\pi_{T} (\Delta_{T}) ||$ Name of the synthetic data model. (c) shows the ranking of the models across different trustworthiness profiles $\omega$ given in Table \ref{['Table:weights']}.
  • ...and 11 more figures