Table of Contents
Fetching ...

Attesting Distributional Properties of Training Data for Machine Learning

Vasisht Duddu, Anudeep Das, Nora Khayata, Hossein Yalame, Thomas Schneider, N. Asokan

TL;DR

This work tackles regulatory pressure to verify distributional properties of ML training data without disclosing sensitive data by introducing ML property attestation. It compares three mechanisms—inference-based, cryptographic MPC-based, and a hybrid approach—to attest properties like population representativeness $p_{req}$. Empirical results show that inference-based attestation lacks universal effectiveness, cryptographic attestation is robust but expensive, and the hybrid scheme offers a practical balance with reduced cost and improved robustness. The study provides a framework for privacy-preserving, regulator-friendly verification of training-data properties and discusses deployment trade-offs and future enhancements for real-world adoption, including secure outsourcing and adversarial robustness.

Abstract

The success of machine learning (ML) has been accompanied by increased concerns about its trustworthiness. Several jurisdictions are preparing ML regulatory frameworks. One such concern is ensuring that model training data has desirable distributional properties for certain sensitive attributes. For example, draft regulations indicate that model trainers are required to show that training datasets have specific distributional properties, such as reflecting diversity of the population. We propose the notion of property attestation allowing a prover (e.g., model trainer) to demonstrate relevant distributional properties of training data to a verifier (e.g., a customer) without revealing the data. We present an effective hybrid property attestation combining property inference with cryptographic mechanisms.

Attesting Distributional Properties of Training Data for Machine Learning

TL;DR

This work tackles regulatory pressure to verify distributional properties of ML training data without disclosing sensitive data by introducing ML property attestation. It compares three mechanisms—inference-based, cryptographic MPC-based, and a hybrid approach—to attest properties like population representativeness . Empirical results show that inference-based attestation lacks universal effectiveness, cryptographic attestation is robust but expensive, and the hybrid scheme offers a practical balance with reduced cost and improved robustness. The study provides a framework for privacy-preserving, regulator-friendly verification of training-data properties and discusses deployment trade-offs and future enhancements for real-world adoption, including secure outsourcing and adversarial robustness.

Abstract

The success of machine learning (ML) has been accompanied by increased concerns about its trustworthiness. Several jurisdictions are preparing ML regulatory frameworks. One such concern is ensuring that model training data has desirable distributional properties for certain sensitive attributes. For example, draft regulations indicate that model trainers are required to show that training datasets have specific distributional properties, such as reflecting diversity of the population. We propose the notion of property attestation allowing a prover (e.g., model trainer) to demonstrate relevant distributional properties of training data to a verifier (e.g., a customer) without revealing the data. We present an effective hybrid property attestation combining property inference with cryptographic mechanisms.
Paper Structure (18 sections, 14 figures, 6 tables)

This paper contains 18 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Inference-based Attestation: During preparation, $\mathcal{V}$ trains $f_{att}$ using the first layer parameters of models trained on the training data $\mathcal{D}^{tr}_{\mathcal{P}}$ with $p_{req}$ ($\{\mathcal{M}^i_{p_{req}}\}_{i=1}^{\mathcal{N}_{m}\xspace}$) and !$p_{req}$ ($\{\mathcal{M}^i_{!p_{req}}\}_{i=1}^{\mathcal{N}_{m}\xspace}$). During attestation, $\mathcal{V}$ uses first layer parameters of $\mathcal{M}_{p}$ to attest if it was indeed trained on $\mathcal{D}^{tr}_{\mathcal{P}}$ with $p_{req}$ or not.
  • Figure 2: Cryptographic Attestation: $\mathcal{P}$ sends the secret shares of the training data $\mathcal{D}^{tr}_{\mathcal{P}}$ to $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$. The servers securely compute "DistCheck" for $\mathcal{D}^{tr}_{\mathcal{P}}$ and train $\mathcal{M}_{2pc}$ on $\mathcal{D}^{tr}_{\mathcal{P}}$ with their secret shares using 2PC. The output shares are then sent to $\mathcal{V}$ for reconstructs the outputs.
  • Figure 3: ARXIV
  • Figure 4: BONEAGE
  • Figure 5: CENSUS-R
  • ...and 9 more figures