Table of Contents
Fetching ...

Protecting Confidentiality, Privacy and Integrity in Collaborative Learning

Dong Chen, Alice Dethise, Istemi Ekin Akkus, Ivica Rimac, Klaus Satzke, Antti Koskela, Marco Canini, Wei Wang, Ruichuan Chen

TL;DR

Citadel++ tackles the challenge of confidential collaboration between dataset and model owners by combining VM-level TEEs, an enhanced DP-SGD-based privacy barrier, and OS-level sandboxing to protect datasets, models, and training code while preserving user privacy and performance. The system introduces DP-private masking, dynamic gradient clipping, and DP noise correction to provide strong privacy guarantees even under collusion, and augments TEEs with sandboxing and integrity mechanisms (including CVM initialization/runtime attestation and container-image integrity) to prevent leaks from untrusted training code. Formal DP analyses tie the privacy barrier to DP accounting via privacy loss random variables, and the implementation demonstrates practical performance with negligible sandboxing overhead and large speedups on GPU-TEE hardware compared to prior privacy-preserving systems. Overall, Citadel++ enables scalable, confidential, and private collaborative learning without requiring inspection of model code, while delivering model utility comparable to central DP-SGD baselines and outperforming state-of-the-art privacy-preserving systems by up to 543x on CPU and 113x on GPU.

Abstract

A collaboration between dataset owners and model owners is needed to facilitate effective machine learning (ML) training. During this collaboration, however, dataset owners and model owners want to protect the confidentiality of their respective assets (i.e., datasets, models and training code), with the dataset owners also caring about the privacy of individual users whose data is in their datasets. Existing solutions either provide limited confidentiality for models and training code, or suffer from privacy issues due to collusion. We present Citadel++, a collaborative ML training system designed to simultaneously protect the confidentiality of datasets, models and training code as well as the privacy of individual users. Citadel++ enhances differential privacy mechanisms to safeguard the privacy of individual user data while maintaining model utility. By employing Virtual Machine-level Trusted Execution Environments (TEEs) as well as the improved sandboxing and integrity mechanisms through OS-level techniques, Citadel++ effectively preserves the confidentiality of datasets, models and training code, and enforces our privacy mechanisms even when the models and training code have been maliciously designed. Our experiments show that Citadel++ provides model utility and performance while adhering to the confidentiality and privacy requirements of dataset owners and model owners, outperforming the state-of-the-art privacy-preserving training systems by up to 543x on CPU and 113x on GPU TEEs.

Protecting Confidentiality, Privacy and Integrity in Collaborative Learning

TL;DR

Citadel++ tackles the challenge of confidential collaboration between dataset and model owners by combining VM-level TEEs, an enhanced DP-SGD-based privacy barrier, and OS-level sandboxing to protect datasets, models, and training code while preserving user privacy and performance. The system introduces DP-private masking, dynamic gradient clipping, and DP noise correction to provide strong privacy guarantees even under collusion, and augments TEEs with sandboxing and integrity mechanisms (including CVM initialization/runtime attestation and container-image integrity) to prevent leaks from untrusted training code. Formal DP analyses tie the privacy barrier to DP accounting via privacy loss random variables, and the implementation demonstrates practical performance with negligible sandboxing overhead and large speedups on GPU-TEE hardware compared to prior privacy-preserving systems. Overall, Citadel++ enables scalable, confidential, and private collaborative learning without requiring inspection of model code, while delivering model utility comparable to central DP-SGD baselines and outperforming state-of-the-art privacy-preserving systems by up to 543x on CPU and 113x on GPU.

Abstract

A collaboration between dataset owners and model owners is needed to facilitate effective machine learning (ML) training. During this collaboration, however, dataset owners and model owners want to protect the confidentiality of their respective assets (i.e., datasets, models and training code), with the dataset owners also caring about the privacy of individual users whose data is in their datasets. Existing solutions either provide limited confidentiality for models and training code, or suffer from privacy issues due to collusion. We present Citadel++, a collaborative ML training system designed to simultaneously protect the confidentiality of datasets, models and training code as well as the privacy of individual users. Citadel++ enhances differential privacy mechanisms to safeguard the privacy of individual user data while maintaining model utility. By employing Virtual Machine-level Trusted Execution Environments (TEEs) as well as the improved sandboxing and integrity mechanisms through OS-level techniques, Citadel++ effectively preserves the confidentiality of datasets, models and training code, and enforces our privacy mechanisms even when the models and training code have been maliciously designed. Our experiments show that Citadel++ provides model utility and performance while adhering to the confidentiality and privacy requirements of dataset owners and model owners, outperforming the state-of-the-art privacy-preserving training systems by up to 543x on CPU and 113x on GPU TEEs.

Paper Structure

This paper contains 41 sections, 5 theorems, 27 equations, 13 figures.

Key Result

Theorem 1

We have:

Figures (13)

  • Figure 1: High-level overview of Citadel++.
  • Figure 2: Differential-private masking.
  • Figure 3: Citadel++ service code stack.
  • Figure 4: Citadel++ service integrity mechanisms.
  • Figure 5: Model accuracy and convergence over time, compared with the non-private baselines.
  • ...and 8 more figures

Theorems & Definitions (7)

  • Definition 1: zhu2022optimal
  • Definition 2
  • Theorem 1: gopi2021
  • Theorem 2: gopi2021zhu2022optimal
  • Theorem 3
  • Theorem 4
  • Theorem 5