CriteoPrivateAds: A Real-World Bidding Dataset to Design Private Advertising Systems
Mehdi Sebbar, Corentin Odic, Mathieu Léchine, Aloïs Bissuel, Nicolas Chrysanthos, Anthony D'Amato, Alexandre Gilotte, Fabian Höring, Sarah Nogueira, Maxime Vono
TL;DR
The paper tackles the lack of public ground truth for evaluating private advertising systems under differential privacy and related browser-vendor APIs. It introduces CriteoPrivateAds, a large real-world anonymised bidding dataset with $100{,}M$ displays across $30$ days, hashed IDs, DP-friendly signals, and labels such as $is\_clicked$, $is\_click\_landed$, and $nb\_sales$, enabling offline benchmarking of DP and aggregation-based approaches. The authors provide baseline results using a re-weighted loss $\tilde{\ell}(y,\hat{y})$ and a calibrated metric $LLH_{CompVN}(f)$, and outline privacy-preserving paradigms including DP-SGD, learning from label proportions, and two-tower Protected Audiences inference, to evaluate privacy-utility trade-offs under Chrome-Privacy Sandbox constraints. The dataset and accompanying baselines are intended to foster rigorous experimentation and innovation in private bidding, supporting production-relevant realism while protecting user data and promoting a viable open internet ecosystem.
Abstract
In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAds, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.
