Statistical Learning from Attribution Sets
Lorne Applebaum, Robert Busa-Fekete, August Y. Chen, Claudio Gentile, Tomer Koren, Aryan Mokhtari
TL;DR
The paper addresses learning CVR models when explicit click-to-conversion links are unavailable due to privacy, formulating learning from attribution sets generated by an adversary with a known prior. It derives an unbiased estimator of the population loss by decomposing the loss into base and label-dependent terms and mapping the label signal to observable signals through attribution sets, enabling ERM with generalization guarantees. Theoretical results show that sample complexity scales with the prior informativeness via $\Sigma=\|\pi\|_2^2$ and that robustness to prior estimation errors is possible, via a bias term that depends on $\|\pi-\widehat{\pi}\|$. Empirical results on MNIST, CIFAR-10, and Higgs demonstrate substantial improvements over industry baselines, particularly when attribution sets are large or overlapping, validating the practical potential of privacy-preserving attribution learning.
Abstract
We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.
