Table of Contents
Fetching ...

Session-Level Dynamic Ad Load Optimization using Offline Robust Reinforcement Learning

Tao Liu, Qi Xu, Wei Shi, Zhigang Hua, Shuang Yang

TL;DR

An offline deep Q-network (DQN)-based framework that effectively mitigates confounding bias in dynamic systems and demonstrates more than 80% offline gains compared to the best causal learning-based production baseline is developed.

Abstract

Session-level dynamic ad load optimization aims to personalize the density and types of delivered advertisements in real time during a user's online session by dynamically balancing user experience quality and ad monetization. Traditional causal learning-based approaches struggle with key technical challenges, especially in handling confounding bias and distribution shifts. In this paper, we develop an offline deep Q-network (DQN)-based framework that effectively mitigates confounding bias in dynamic systems and demonstrates more than 80% offline gains compared to the best causal learning-based production baseline. Moreover, to improve the framework's robustness against unanticipated distribution shifts, we further enhance our framework with a novel offline robust dueling DQN approach. This approach achieves more stable rewards on multiple OpenAI-Gym datasets as perturbations increase, and provides an additional 5% offline gains on real-world ad delivery data. Deployed across multiple production systems, our approach has achieved outsized topline gains. Post-launch online A/B tests have shown double-digit improvements in the engagement-ad score trade-off efficiency, significantly enhancing our platform's capability to serve both consumers and advertisers.

Session-Level Dynamic Ad Load Optimization using Offline Robust Reinforcement Learning

TL;DR

An offline deep Q-network (DQN)-based framework that effectively mitigates confounding bias in dynamic systems and demonstrates more than 80% offline gains compared to the best causal learning-based production baseline is developed.

Abstract

Session-level dynamic ad load optimization aims to personalize the density and types of delivered advertisements in real time during a user's online session by dynamically balancing user experience quality and ad monetization. Traditional causal learning-based approaches struggle with key technical challenges, especially in handling confounding bias and distribution shifts. In this paper, we develop an offline deep Q-network (DQN)-based framework that effectively mitigates confounding bias in dynamic systems and demonstrates more than 80% offline gains compared to the best causal learning-based production baseline. Moreover, to improve the framework's robustness against unanticipated distribution shifts, we further enhance our framework with a novel offline robust dueling DQN approach. This approach achieves more stable rewards on multiple OpenAI-Gym datasets as perturbations increase, and provides an additional 5% offline gains on real-world ad delivery data. Deployed across multiple production systems, our approach has achieved outsized topline gains. Post-launch online A/B tests have shown double-digit improvements in the engagement-ad score trade-off efficiency, significantly enhancing our platform's capability to serve both consumers and advertisers.
Paper Structure (24 sections, 3 theorems, 13 equations, 6 figures, 4 tables)

This paper contains 24 sections, 3 theorems, 13 equations, 6 figures, 4 tables.

Key Result

Proposition 1

For the IPM uncertainty set with $\mathcal{F}$ in (eqn:func-class), we have $\inf_{P \in \mathcal{P}_{s, a}} P^T V_w = (P_{s, a}^0)^T V_w - \delta \|w_{2:d}\|$.

Figures (6)

  • Figure 1: Structure of the session-level dynamic ad load optimization system. A novel offline robust reinforcement learning approach is applied in the prediction module, which generates state-action values as inputs to the decision module (see Fig. \ref{['fig:robust-ddqn']} for the detailed structure).
  • Figure 2: Structure of robust dueling DQN. Robustness is incorporated through the empirical robust Bellman operator (Equations (\ref{['eqn:emp-bell']}) and (\ref{['eqn:general-emp-bell']})).
  • Figure 3: Data analysis on confounding bias (X|T-shifts) and time-wise user behavior shift (Y|X-shifts)
  • Figure 4: Cumulative rewards of robust dueling DQN and dueling DQN under perturbation
  • Figure 5: Test AUCC of offline DQN and T-learner on session-level production data.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Theorem 1