Table of Contents
Fetching ...

I Prefer not to Say: Protecting User Consent in Models with Optional Personal Data

Tobias Leemann, Martin Pawelczyk, Christian Thomas Eberle, Gjergji Kasneci

TL;DR

The paper tackles privacy concerns in ML systems where users can opt into optional data by formalizing Availability Inference Restriction (AIR) and Protected User Consent (PUC). It presents a model-agnostic data augmentation approach, PUCIDA, to learn PUC-compliant predictors that minimize loss $\mathcal{L}$ under AIR, and proves both predictive non-degradation and finite-sample convergence guarantees. The work provides a formal, scalable framework for balancing privacy, accuracy, and regulatory compliance, including a multi-feature generalization (r-dimensional PUC) and strategic-withholding considerations. Empirically, PUCIDA removes the disadvantage faced by non-sharers while enabling the decision-maker to leverage consenting users’ data with only moderate performance trade-offs, as demonstrated on eight real datasets and synthetic benchmarks.

Abstract

We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.

I Prefer not to Say: Protecting User Consent in Models with Optional Personal Data

TL;DR

The paper tackles privacy concerns in ML systems where users can opt into optional data by formalizing Availability Inference Restriction (AIR) and Protected User Consent (PUC). It presents a model-agnostic data augmentation approach, PUCIDA, to learn PUC-compliant predictors that minimize loss under AIR, and proves both predictive non-degradation and finite-sample convergence guarantees. The work provides a formal, scalable framework for balancing privacy, accuracy, and regulatory compliance, including a multi-feature generalization (r-dimensional PUC) and strategic-withholding considerations. Empirically, PUCIDA removes the disadvantage faced by non-sharers while enabling the decision-maker to leverage consenting users’ data with only moderate performance trade-offs, as demonstrated on eight real datasets and synthetic benchmarks.

Abstract

We examine machine learning models in a setup where individuals have the choice to share optional personal information with a decision-making system, as seen in modern insurance pricing models. Some users consent to their data being used whereas others object and keep their data undisclosed. In this work, we show that the decision not to share data can be considered as information in itself that should be protected to respect users' privacy. This observation raises the overlooked problem of how to ensure that users who protect their personal data do not suffer any disadvantages as a result. To address this problem, we formalize protection requirements for models which only use the information for which active user consent was obtained. This excludes implicit information contained in the decision to share data or not. We offer the first solution to this problem by proposing the notion of Protected User Consent (PUC), which we prove to be loss-optimal under our protection requirement. We observe that privacy and performance are not fundamentally at odds with each other and that it is possible for a decision maker to benefit from additional data while respecting users' consent. To learn PUC-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we analyze the implications of PUC on challenging real datasets, tasks, and models.
Paper Structure (45 sections, 8 theorems, 52 equations, 11 figures, 19 tables, 1 algorithm)

This paper contains 45 sections, 8 theorems, 52 equations, 11 figures, 19 tables, 1 algorithm.

Key Result

Theorem 1

Let $f:\mathcal{X} \rightarrow \mathcal{Y} \subseteq \mathbb{R}$ be a full feature model (i.e., including optional features). Among all predictors compatible with the Availability Inference Restriction, a model $f$ with minimal loss is given by:

Figures (11)

  • Figure 1: Overview of the relevant stakeholders. We consider a case where users can voluntarily provide information on optional features or choose to leave them undisclosed. The goals of sharers, non-sharers, and the decision maker have to be reconciled.
  • Figure 2: Samples for the insurance use-case. We have two base features $\mathbf{b}$ and one optional feature $z^*$, which either takes an observed value $z$, or it takes a value of N/A if unobserved. The variable $a \in \left\{0,1\right\}$ indicates the availability of the feature. The goal is to predict the label $y$.
  • Figure 3: Explaining PUCIDA. Our data augmentation procedure expands each instance with optional information into two samples: The original instance and a synthetic sample (+). The synthetic samples retain the base features and the labels, but the information on the optional features is dropped (fitness score $\xrightarrow{}$N/A). The model sees samples with the same base features with a missing value and will thus base its decision only on the base features. In this example, given the base features ("NSW", basic) and no optional statements, the model would estimate the costs to be 24k$, which is the dataset average conditioned on these values.
  • Figure 4: PUCIDA is model-agnostic. The PUC-gaps are close to zero when applying our technique across a variety of common models on the simulated dataset.
  • Figure 5: Convergence rate of models under PUCIDA. The estimate of PUC converges to the true value at a rate of $\mathcal{O}(\frac{1}{N})$ for the baseline estimator $\hat{\mu}$ and other commonly used models.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 1: Availability Inference Restriction
  • Theorem 1: 1D-PUC
  • Corollary 1: Predictive Non-Degradation of $f^{*}_ {\text{PUC}}$
  • Theorem 2: Optimality of $f^{*}_ {\text{PUC}}$ under strategic actions
  • Definition 2: Protected User Consent, PUC
  • Theorem 3
  • Theorem 4: Finite Sample Convergence
  • Lemma 1: CSP-compliant models can degrade model performance over base feature model
  • Lemma 2
  • Theorem 5: Convergence of Finite Sample Approximation