Table of Contents
Fetching ...

Private Prediction via Shrinkage

Chao Yan

TL;DR

This paper advances private prediction in streaming settings by reducing the dependence on the number of queries $T$ from the standard $\sqrt{T}$ to polylogarithmic scales. Building on the Dwork–Feldman and Naor–NNSY frameworks, it combines subsample–aggregate and sparse-vector techniques with a shrinkage strategy to bound the number of hard queries, enabling private labeling of exponentially many queries for oblivious online adversaries. For adaptive online adversaries focusing on halfspaces in $\mathbb{R}^d$, it leverages a geometric reduction to linear feasibility via $cdepth$, showing that after at most $d+1$ constraint halvings the remaining hypotheses agree on future queries, and achieving a sample complexity of $\tilde{O}(d^{5.5}\log T)$. Overall, the results establish that super-polynomial query streams can be privately answered with polylogarithmic dependence on $T$ under standard adversary models, with concrete bounds tied to VC dimension and ambient dimension.

Abstract

We study differentially private prediction introduced by Dwork and Feldman (COLT 2018): an algorithm receives one labeled sample set $S$ and then answers a stream of unlabeled queries while the output transcript remains $(\varepsilon,δ)$-differentially private with respect to $S$. Standard composition yields a $\sqrt{T}$ dependence for $T$ queries. We show that this dependence can be reduced to polylogarithmic in $T$ in streaming settings. For an oblivious online adversary and any concept class $\mathcal{C}$, we give a private predictor that answers $T$ queries with $|S|= \tilde{O}(VC(\mathcal{C})^{3.5}\log^{3.5}T)$ labeled examples. For an adaptive online adversary and halfspaces over $\mathbb{R}^d$, we obtain $|S|=\tilde{O}\left(d^{5.5}\log T\right)$.

Private Prediction via Shrinkage

TL;DR

This paper advances private prediction in streaming settings by reducing the dependence on the number of queries from the standard to polylogarithmic scales. Building on the Dwork–Feldman and Naor–NNSY frameworks, it combines subsample–aggregate and sparse-vector techniques with a shrinkage strategy to bound the number of hard queries, enabling private labeling of exponentially many queries for oblivious online adversaries. For adaptive online adversaries focusing on halfspaces in , it leverages a geometric reduction to linear feasibility via , showing that after at most constraint halvings the remaining hypotheses agree on future queries, and achieving a sample complexity of . Overall, the results establish that super-polynomial query streams can be privately answered with polylogarithmic dependence on under standard adversary models, with concrete bounds tied to VC dimension and ambient dimension.

Abstract

We study differentially private prediction introduced by Dwork and Feldman (COLT 2018): an algorithm receives one labeled sample set and then answers a stream of unlabeled queries while the output transcript remains -differentially private with respect to . Standard composition yields a dependence for queries. We show that this dependence can be reduced to polylogarithmic in in streaming settings. For an oblivious online adversary and any concept class , we give a private predictor that answers queries with labeled examples. For an adaptive online adversary and halfspaces over , we obtain .
Paper Structure (23 sections, 13 theorems, 17 equations, 3 figures, 1 table, 4 algorithms)

This paper contains 23 sections, 13 theorems, 17 equations, 3 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Let $(X,R)$ have VC dimension $d$. Let $S\subseteq X$ be a subset of $X$. Let $0<\alpha,\beta\leq 1$. Let $S'\subseteq S$ be a random subset of $S$ with size at least $O\left(\frac{d\cdot\log\frac{d}{\alpha}+\log\frac{1}{\beta}}{\alpha^2}\right).$ Then with probability at least $1-\beta$, $S'$ is an

Figures (3)

  • Figure 1: In the left figure, we illustrate that when a “hard” query $x$ occurs, the current hypothesis set splits into two subspaces, $\mathcal{C}|_{h(x)=1}$ and $\mathcal{C}|_{h(x)=-1}$. We then guess a label uniformly at random (say, 1) and update the hypothesis set by restricting to $\mathcal{C}|_{h(x)=1}$, as shown in the right figure.
  • Figure 2: After answering $O(VC(\mathcal{C})\log T)$ “hard” queries, the remaining hypothesis space collapses to a set of hypotheses that induce the same labeling on the entire query sequence $x_1,\dots, x_T$ (including queries that have not yet appeared in the stream).
  • Figure 3: In the left figure, we illustrate that when a “hard” query occurs, the current hypotheses are roughly split across the two sides of the induced hyperplane. We then update the feasible set by restricting all hypotheses to lie on this hyperplane for subsequent rounds, while preserving the existence of a hypothesis (point) with high $cdepth$, as shown in the right figure.

Theorems & Definitions (45)

  • Definition 1
  • Definition 2: Vapnik-Chervonenkis dimension VCHaussler1986EpsilonnetsAS
  • Definition 3: $\alpha$-approximation VCHaussler1986EpsilonnetsAS
  • Theorem 1: VCHaussler1986EpsilonnetsAS
  • Definition 4: Generalization and empirical error
  • Theorem 2: BlumerEhHaWa89kaplan2020private
  • Definition 5: Differential Privacy DMNS06
  • Theorem 3: Advanced composition DRV10
  • Lemma 1: Privacy for BetweenThresholds BunSU16
  • Lemma 2: Accuracy for BetweenThresholds BunSU16
  • ...and 35 more