Table of Contents
Fetching ...

Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition

Muhammad H. Ashiq, Peter Triantafillou, Hung Yun Tseng, Grigoris G. Chrysos

TL;DR

This work introduces test-time privacy for open-weight image classifiers by defending against adversaries who leverage confident predictions on corrupted data. It formulates a Pareto-optimal fine-tuning approach that makes forget-set predictions uniform while preserving performance on retained data, enabled by a uniform learner and two algorithms: the Exact Pareto Learner and a certified variant. The authors derive a tight theoretical privacy-utility bound and provide a certified, Hessian-based framework for practical deployment, including an unbiased Hessian estimator and online extensions. Empirically, the method yields over 3× reductions in forget-set confidence with minimal drops in retain/test accuracy across multiple datasets and architectures, outperforming several baselines and offering reproducible results.

Abstract

A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves $(\varepsilon, δ)$ guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least $>3\times$ stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.

Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition

TL;DR

This work introduces test-time privacy for open-weight image classifiers by defending against adversaries who leverage confident predictions on corrupted data. It formulates a Pareto-optimal fine-tuning approach that makes forget-set predictions uniform while preserving performance on retained data, enabled by a uniform learner and two algorithms: the Exact Pareto Learner and a certified variant. The authors derive a tight theoretical privacy-utility bound and provide a certified, Hessian-based framework for practical deployment, including an unbiased Hessian estimator and online extensions. Empirically, the method yields over 3× reductions in forget-set confidence with minimal drops in retain/test accuracy across multiple datasets and architectures, outperforming several baselines and offering reproducible results.

Abstract

A key concern for AI safety remains understudied in the machine learning (ML) literature: how can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others? This is particularly pertinent given the rise of open-weight models, where simply masking model outputs does not suffice to prevent adversaries from recovering harmful predictions. To address this threat, which we call *test-time privacy*, we induce maximal uncertainty on protected instances while preserving accuracy on all other instances. Our proposed algorithm uses a Pareto optimal objective that explicitly balances test-time privacy against utility. We also provide a certifiable approximation algorithm which achieves guarantees without convexity assumptions. We then prove a tight bound that characterizes the privacy-utility tradeoff that our algorithms incur. Empirically, our method obtains at least stronger uncertainty than pretraining with marginal drops in accuracy on various image recognition benchmarks. Altogether, this framework provides a tool to guarantee additional protection to end users.

Paper Structure

This paper contains 62 sections, 21 theorems, 92 equations, 10 figures, 22 tables, 7 algorithms.

Key Result

Proposition 1

Suppose we have a hypothesis space $\mathcal{H}_{\mathcal{W}}$ consisting of functions where the ultimate layer is an affine transformation and the outputs are passed through a softmax. Let $\mathcal{K}$ be a uniform learner. Then, $f_{\mathcal{K}(\mathcal{D})} \in \mathcal{H}_{\mathcal{W}} \; \fora

Figures (10)

  • Figure 1: Across datasets, observe a significant drop in confidence distance, where lower is better, for both our algorithms. We also observe that both algorithms provide strong accuracy on the retain set. We observe similar behavior for the test set in \ref{['appendix:additional_experiments']}, while the baselines are inconsistent. Variance is negligible for all metrics.
  • Figure 2: From \ref{['fig:ret_pareto']}, we observe that for simple datasets, the retain accuracy decreases smoothly. However, for larger datasets like CIFAR10 and CIFAR100 as one passes $\theta \approx 0.75$, retain accuracy drops significantly. This motivates our choice of $\theta = 0.75$ used throughout our experiments. In \ref{['fig:conf_pareto']} we observe that the confidence distance decreases roughly linearly as $\theta$ increases.
  • Figure K.3: Accuracy on test set for baselines as well as \ref{['algo:finetuning_algo']} and \ref{['algo:hess_exact_algo']} with $\theta = 0.75$.
  • Figure K.4: For CIFAR10 and CIFAR100 ResNet50, we observe a sharp drop in confidence distance followed by a sharp increase in \ref{['fig:conf_vs_epoch']}, in line with the drops and increases for retain accuracy in \ref{['fig:ret_vs_epoch']}. Test accuracy is similar. This highlights the need for early stopping when using \ref{['algo:finetuning_algo']} for large models, since otherwise one escapes from a good privacy-utility tradeoff. For smaller models, e.g. MNIST MLP, this issue does not persist--we obtain good uniformity after an initial drop in accuracy, but then increase accuracy and decrease confidence distance simultaneously.
  • Figure K.5: Test Accuracy vs. Epochs, $\theta = 0.75$, MNIST. This has similar behavior to \ref{['fig:ret_vs_epoch']}.
  • ...and 5 more figures

Theorems & Definitions (45)

  • Definition 1: Uniform learner
  • Proposition 1
  • Proposition 2
  • Definition 2: $(\varepsilon, \delta)$-differential privacy
  • Definition 3: $(\varepsilon, \delta)$-certified unlearning
  • Definition 4: $(\varepsilon,\delta,\theta)$-certified Pareto learner
  • Proposition 3
  • Corollary 1
  • Theorem 4.1
  • Theorem F.1
  • ...and 35 more