The Privacy-Utility Trade-off in the Topics API

Mário S. Alvim; Natasha Fernandes; Annabelle McIver; Gabriel H. Nunes

The Privacy-Utility Trade-off in the Topics API

Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Gabriel H. Nunes

TL;DR

The paper addresses the privacy-utility trade-off of Google's Topics API, positioned as an alternative to third-party cookies in privacy-preserving advertising. It builds a formal model using Quantitative Information Flow to quantify privacy leakage and advertising utility, deriving average- and max-case bounds that account for unknown correlations and the differential privacy parameter $\epsilon$. The authors provide novel theoretical results and validate them with real-world AOL-derived datasets, showing that generalization and bounded noise substantially reduce leakage, while DP adds plausible deniability; however, max-case capacities can remain large for bigger taxonomies. The work yields practical guidance on how taxonomy size, top-$s$ set size, and the DP parameter influence privacy risk and IBA utility, and provides datasets and code to evaluate future API updates and taxonomy choices.

Abstract

The ongoing deprecation of third-party cookies by web browser vendors has sparked the proposal of alternative methods to support more privacy-preserving personalized advertising on web browsers and applications. The Topics API is being proposed by Google to provide third-parties with "coarse-grained advertising topics that the page visitor might currently be interested in". In this paper, we analyze the re-identification risks for individual Internet users and the utility provided to advertising companies by the Topics API, i.e. learning the most popular topics and distinguishing between real and random topics. We provide theoretical results dependent only on the API parameters that can be readily applied to evaluate the privacy and utility implications of future API updates, including novel general upper-bounds that account for adversaries with access to unknown, arbitrary side information, the value of the differential privacy parameter $ε$, and experimental results on real-world data that validate our theoretical model.

The Privacy-Utility Trade-off in the Topics API

TL;DR

. The authors provide novel theoretical results and validate them with real-world AOL-derived datasets, showing that generalization and bounded noise substantially reduce leakage, while DP adds plausible deniability; however, max-case capacities can remain large for bigger taxonomies. The work yields practical guidance on how taxonomy size, top-

set size, and the DP parameter influence privacy risk and IBA utility, and provides datasets and code to evaluate future API updates and taxonomy choices.

Abstract

, and experimental results on real-world data that validate our theoretical model.

Paper Structure (69 sections, 6 theorems, 42 equations, 6 figures, 8 tables, 3 algorithms)

This paper contains 69 sections, 6 theorems, 42 equations, 6 figures, 8 tables, 3 algorithms.

Introduction
Objectives
Contributions
Plan of the paper.
Technical Background
Third-party Cookies
Topics API
Quantitative Information Flow
Prior Vulnerability
Information leakage from a channel
Posterior Vulnerability
Multiplicative Leakage
Cascading
Internal Fixed-Probability Choice
Bayes vulnerability
...and 54 more sections

Key Result

Theorem 4

For channel matrices $\mathsf{C} : \mathcal{X} \to \mathcal{Y}$ and $\mathsf{D} : \mathcal{X'} \to \mathcal{Y'}$, the multiplicative (average-case) Bayes capacity of the Kronecker product $\mathsf{C} \otimes \mathsf{D} : (\mathcal{X}, \mathcal{X'}) \to (\mathcal{Y}, \mathcal{Y'})$ is: i.e. the product of the multiplicative Bayes capacities of $\mathsf{C}$ and $\mathsf{D}$.

Figures (6)

Figure 1: Information collected by an interest-based advertising (IBA) company.
Figure 2: Pipeline for third-party cookies according to Alg. \ref{['alg:cookies']}. The channel $\mathsf{C}_{\mathit{BH}}$ maps Internet users to their respective Browsing Histories. The channel $\mathsf{C}_{U}$ maps browsing histories to Unique identifiers uid$_i$ defined by third-party cookies. The final channel for utility analyses is $\mathsf{C}_{U}$ above and the final channel for privacy analyses is $\mathsf{C}_{\mathcal{C}}$ in Fig. \ref{['fig:privacy-channels']}.
Figure 3: Pipeline for the Topics API according to Alg. \ref{['alg:topics-derive']} and Alg. \ref{['alg:topics-report']}. The channel $\mathsf{C}_{\mathit{BH}}$ maps Internet users to their respective Browsing Histories. The channel $\mathsf{C}_{G}$ maps browsing histories to sets of top-$(s=2)$ topics, a Generalization step. The channel $\mathsf{C}_{\mathit{BN} \oplus_{0.05} \mathit{DP}}$ maps sets of top-$(s=2)$ topics to individual topics; this channel is the result of an internal fixed-probability choice between channels $\mathsf{C}_{\mathit{BN}}$ and $\mathsf{C}_{\mathit{DP}}$. The channel $\mathsf{C}_{\mathit{BN}}$ is the case in which the API reports a topic from the sets of top-$(s=2)$ topics with uniform probability, i.e. each with $1/s$ probability, a Bounded Noise step that happens with $(1-r)=95\%$ chance. The channel $\mathsf{C}_{\mathit{DP}}$ is the case in which the API reports a random topic from the whole taxonomy with uniform probability, i.e. each with $1/m$ probability, a Differential Privacy step that happens with $r=5\%$ chance. The final channel for utility analyses is $\mathsf{C}_{\mathit{BN} \oplus_{0.05} \mathit{DP}}$ above and the final channel for privacy analyses is $\mathsf{C}_{\mathcal{T}}$ in Fig. \ref{['fig:privacy-channels']}.
Figure 4: Final channels for privacy analyses. The channel $\mathsf{C}_{\mathcal{C}}$ is the cascading of channels $\mathsf{C}_{\mathit{BH}} \mathsf{C}_{U}$ and the channel $\mathsf{C}_{\mathcal{T}}$ is the cascading of channels $\mathsf{C}_{\mathit{BH}} \mathsf{C}_{G} \mathsf{C}_{\mathit{BN} \oplus_{0.05} \mathit{DP}}$.
Figure 5: Expected value for $A^{n}$ with respect to the binomial distribution (cf. \ref{['eq:counting-binomial']}), where $A = \mathtt{p}/\mathtt{q}$, $\mathtt{p} = 1-r/s + r/m$, and $\mathtt{q} = r/m$, according to channel $\mathsf{C}_{\mathit{BN} \oplus_{r} \mathit{DP}}$ (cf. \ref{['eq:model-topics-utility-channel']}), for $2 \leq N \leq 30$, $s=5$, $m=349$ or $m=629$, and $r=0.05$. As the population size of Internet users reporting topics increases, the probability of correctly counting the occurrences of a topic $t$ in all the top-$s$ sets approaches zero for both taxonomy sizes.
...and 1 more figures

Theorems & Definitions (9)

Remark 1
Remark 2
Definition 3
Theorem 4
Corollary 5
Theorem 6
Theorem 7
Lemma 8
Theorem 9

The Privacy-Utility Trade-off in the Topics API

TL;DR

Abstract

The Privacy-Utility Trade-off in the Topics API

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)