Table of Contents
Fetching ...

Improving conversion rate prediction via self-supervised pre-training in online advertising

Alex Shtoff, Yohay Kaplan, Ariel Raviv

TL;DR

To address data sparsity in CVR prediction under latency constraints, the authors introduce a self-supervised pre-training framework in which an auto-encoder is trained on all conversion events and provides a low-dimensional code $C(\Omega)$ that augments an incremental Offset FM-based CVR model trained only on click-attributed data. The code is integrated via a linear term $\langle W, C(\Omega)\rangle$ in the scoring function, enabling continual learning without calibration bias, and random Fourier features are used to address nonlinear separability with minimal latency impact. The approach yields offline improvements in logloss and AUC and demonstrates online gains in CPM and revenue, with a safe deployment to full Yahoo Gemini native traffic. The work shows practical benefits for calibration-sensitive advertiser goals and demonstrates a scalable path for self-supervision in latency-constrained ad auctions.

Abstract

The task of predicting conversion rates (CVR) lies at the heart of online advertising systems aiming to optimize bids to meet advertiser performance requirements. Even with the recent rise of deep neural networks, these predictions are often made by factorization machines (FM), especially in commercial settings where inference latency is key. These models are trained using the logistic regression framework on labeled tabular data formed from past user activity that is relevant to the task at hand. Many advertisers only care about click-attributed conversions. A major challenge in training models that predict conversions-given-clicks comes from data sparsity - clicks are rare, conversions attributed to clicks are even rarer. However, mitigating sparsity by adding conversions that are not click-attributed to the training set impairs model calibration. Since calibration is critical to achieving advertiser goals, this is infeasible. In this work we use the well-known idea of self-supervised pre-training, and use an auxiliary auto-encoder model trained on all conversion events, both click-attributed and not, as a feature extractor to enrich the main CVR prediction model. Since the main model does not train on non click-attributed conversions, this does not impair calibration. We adapt the basic self-supervised pre-training idea to our online advertising setup by using a loss function designed for tabular data, facilitating continual learning by ensuring auto-encoder stability, and incorporating a neural network into a large-scale real-time ad auction that ranks tens of thousands of ads, under strict latency constraints, and without incurring a major engineering cost. We show improvements both offline, during training, and in an online A/B test. Following its success in A/B tests, our solution is now fully deployed to the Yahoo native advertising system.

Improving conversion rate prediction via self-supervised pre-training in online advertising

TL;DR

To address data sparsity in CVR prediction under latency constraints, the authors introduce a self-supervised pre-training framework in which an auto-encoder is trained on all conversion events and provides a low-dimensional code that augments an incremental Offset FM-based CVR model trained only on click-attributed data. The code is integrated via a linear term in the scoring function, enabling continual learning without calibration bias, and random Fourier features are used to address nonlinear separability with minimal latency impact. The approach yields offline improvements in logloss and AUC and demonstrates online gains in CPM and revenue, with a safe deployment to full Yahoo Gemini native traffic. The work shows practical benefits for calibration-sensitive advertiser goals and demonstrates a scalable path for self-supervision in latency-constrained ad auctions.

Abstract

The task of predicting conversion rates (CVR) lies at the heart of online advertising systems aiming to optimize bids to meet advertiser performance requirements. Even with the recent rise of deep neural networks, these predictions are often made by factorization machines (FM), especially in commercial settings where inference latency is key. These models are trained using the logistic regression framework on labeled tabular data formed from past user activity that is relevant to the task at hand. Many advertisers only care about click-attributed conversions. A major challenge in training models that predict conversions-given-clicks comes from data sparsity - clicks are rare, conversions attributed to clicks are even rarer. However, mitigating sparsity by adding conversions that are not click-attributed to the training set impairs model calibration. Since calibration is critical to achieving advertiser goals, this is infeasible. In this work we use the well-known idea of self-supervised pre-training, and use an auxiliary auto-encoder model trained on all conversion events, both click-attributed and not, as a feature extractor to enrich the main CVR prediction model. Since the main model does not train on non click-attributed conversions, this does not impair calibration. We adapt the basic self-supervised pre-training idea to our online advertising setup by using a loss function designed for tabular data, facilitating continual learning by ensuring auto-encoder stability, and incorporating a neural network into a large-scale real-time ad auction that ranks tens of thousands of ads, under strict latency constraints, and without incurring a major engineering cost. We show improvements both offline, during training, and in an online A/B test. Following its success in A/B tests, our solution is now fully deployed to the Yahoo native advertising system.
Paper Structure (21 sections, 11 equations, 7 figures)

This paper contains 21 sections, 11 equations, 7 figures.

Figures (7)

  • Figure 1: A native ad on Yahoo homepage that resembles the surrounding content.
  • Figure 2: Schematic illustration of our self-supervised pre-training framework. On the left - the auto-encoder that is trained on all conversions, both click-attributed and not. On the right - the CVR prediction model that is trained on clicks and click-attributed conversions, and uses the encoder's code for its training data as an additional feature.
  • Figure 3: Auto-encoder model architecture. For clarity, illustrated with $C = 3$ tabular columns. The input is used to choose embeddings, that are concatenated, and fed to a multi-layer perceptron to produce the code. The decoder, in turn, is a multi-layer perceptron that acts as a set of classifiers with cross-entropy losses, that are averaged to obtain the output.
  • Figure 4: Reconstruction losses over real and random data, of an auto-encoder trained on a month worth of data.
  • Figure 5: Daily generalization metric $\mathrm{Gen}_t$ (closer to 1 is better) versus daily stability metric $\mathrm{Diff}_t$ (lower is better).
  • ...and 2 more figures