Table of Contents
Fetching ...

CriteoPrivateAds: A Real-World Bidding Dataset to Design Private Advertising Systems

Mehdi Sebbar, Corentin Odic, Mathieu Léchine, Aloïs Bissuel, Nicolas Chrysanthos, Anthony D'Amato, Alexandre Gilotte, Fabian Höring, Sarah Nogueira, Maxime Vono

TL;DR

The paper tackles the lack of public ground truth for evaluating private advertising systems under differential privacy and related browser-vendor APIs. It introduces CriteoPrivateAds, a large real-world anonymised bidding dataset with $100{,}M$ displays across $30$ days, hashed IDs, DP-friendly signals, and labels such as $is\_clicked$, $is\_click\_landed$, and $nb\_sales$, enabling offline benchmarking of DP and aggregation-based approaches. The authors provide baseline results using a re-weighted loss $\tilde{\ell}(y,\hat{y})$ and a calibrated metric $LLH_{CompVN}(f)$, and outline privacy-preserving paradigms including DP-SGD, learning from label proportions, and two-tower Protected Audiences inference, to evaluate privacy-utility trade-offs under Chrome-Privacy Sandbox constraints. The dataset and accompanying baselines are intended to foster rigorous experimentation and innovation in private bidding, supporting production-relevant realism while protecting user data and promoting a viable open internet ecosystem.

Abstract

In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAds, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.

CriteoPrivateAds: A Real-World Bidding Dataset to Design Private Advertising Systems

TL;DR

The paper tackles the lack of public ground truth for evaluating private advertising systems under differential privacy and related browser-vendor APIs. It introduces CriteoPrivateAds, a large real-world anonymised bidding dataset with displays across days, hashed IDs, DP-friendly signals, and labels such as , , and , enabling offline benchmarking of DP and aggregation-based approaches. The authors provide baseline results using a re-weighted loss and a calibrated metric , and outline privacy-preserving paradigms including DP-SGD, learning from label proportions, and two-tower Protected Audiences inference, to evaluate privacy-utility trade-offs under Chrome-Privacy Sandbox constraints. The dataset and accompanying baselines are intended to foster rigorous experimentation and innovation in private bidding, supporting production-relevant realism while protecting user data and promoting a viable open internet ecosystem.

Abstract

In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAds, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.

Paper Structure

This paper contains 13 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Discrepancy between empirical distributions of users' contributions in CriteoPrivateAds and associated online traffic.
  • Figure 2: Two-tower inference architecture constraining how private bidding models are built.

Theorems & Definitions (6)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6