A pragmatic policy learning approach to account for users' fatigue in repeated auctions

Benjamin Heymann; Rémi Chan--Renous-Legoubin; Alexandre Gilotte

A pragmatic policy learning approach to account for users' fatigue in repeated auctions

Benjamin Heymann, Rémi Chan--Renous-Legoubin, Alexandre Gilotte

TL;DR

The paper tackles the problem that real-time bidding in repeated online auctions often optimizes only immediate payoff, neglecting long-term value reductions due to user fatigue. It introduces the cost of impatience, develops marginal analysis tools with inverse propensity score estimators, and proposes a fatigue-aware policy-learning approach that reallocates spend across user clusters to maximize value at a fixed budget. By combining offline counterfactual estimation with linearized IPS for variance control, it demonstrates offline improvements and confirms online gains (notably about a 0.7% value increase with roughly a 1% cost reduction). The work provides a practical, reinforcement-learning–inspired methodology for scalable, fatigue-aware bidding in RTB, with potential applicability to other sequential decision tasks.

Abstract

Online advertising banners are sold in real-time through auctions.Typically, the more banners a user is shown, the smaller the marginalvalue of the next banner for this user is. This fact can be detected bybasic ML models, that can be used to predict how previously won auctionsdecrease the current opportunity value. However, learning is not enough toproduce a bid that correctly accounts for how winning the current auctionimpacts the future values. Indeed, a policy that uses this prediction tomaximize the expected payoff of the current auction could be dubbedimpatient because such policy does not fully account for the repeatednature of the auctions. Under this perspective, it seems that most biddersin the literature are impatient. Unsurprisingly, impatience induces a cost.We provide two empirical arguments for the importance of this cost ofimpatience. First, an offline counterfactual analysis and, second, a notablebusiness metrics improvement by mitigating the cost of impatience withpolicy learning

A pragmatic policy learning approach to account for users' fatigue in repeated auctions

TL;DR

Abstract

Paper Structure (16 sections, 3 theorems, 8 equations, 4 figures)

This paper contains 16 sections, 3 theorems, 8 equations, 4 figures.

Introduction
RTB auctions
A short story to illustrate the cost of impatience
Connection with the users' fatigue
Our contribution
Marginal analysis
IPS-based estimators
Linear approximation and marginal ROI
Maximising the value at constant cost
Experiments
Analytics
Proxy for the reward
Observed marginals
Pre-A/B test offline estimation of the resulting policy
Live experiments
...and 1 more sections

Key Result

Proposition 2.1.1

Let $S\subseteq [1..n]$ independent from $\Theta$ and $\alpha > 0$, then is an unbiased estimator of $M(S,\alpha)$.

Figures (4)

Figure 1: Empirical Click-Through-Rates (CTR) and predicted CTR computed on the Criteo Attribution Modeling for Bidding Datasetdiemert2017pcb. The green predictor uses a fatigue variable (the time since the last display), while the black one does not. We observe that the predictor without fatigue variable tends to overpredict on recently exposed users.
Figure 2: This figure shows the standard deviation of exact importance weights appearing in proposition \ref{['eq:ips_estimator']} as a function of the multiplicative factor $\alpha$; computed empirically from lognormal samples. It grows exponentially with $\alpha-1$. On the other hand, the weights of the linearised estimator (in proposition \ref{['eq:estimator_dcost_dvalue']}) are directly proportional to $\alpha-1$, their standard deviation therefore grows linearly.
Figure 3: The plots \ref{['subfig:02/2022']} and \ref{['subfig:04/2022']} represent the confidence intervals of the marginal ROI computed for different levels of ad exposure. Each plot uses sampled data from one month. We processed only the data for which we have access to the ad exposure. The bucket 5 corresponds to extreme values for which we have fewer samples, resulting in larger confidence intervals. Intuitively --- because winning an auction decreases the values of the future auctions --- the bidder should decrease the bid when it forecasts to receive many bidding opportunities.
Figure 4: Pre-A/B test offline estimation of the resulting policy, $\Delta$ on the x axis is the maximum amplitude of the change of parameter.The green vertical line corresponds to the amplitude of change tested during the online A/B test. We see that the linearization drastically reduce the confidence intervals (dotted lines).

Theorems & Definitions (3)

Proposition 2.1.1: Counterfactual estimator
Proposition 2.2.1: Marginal counterfactual estimator
Proposition 2.2.2: marginal ROI

A pragmatic policy learning approach to account for users' fatigue in repeated auctions

TL;DR

Abstract

A pragmatic policy learning approach to account for users' fatigue in repeated auctions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)