Table of Contents
Fetching ...

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

TL;DR

This work tackles auto-bidding under a budget by augmenting a generative auto-bidding framework with a trajectory evaluator and a KL-Lipschitz-constrained score-maximization objective, enabling safe exploration beyond the offline dataset. The core idea, AIGB-Pearl, learns a supervised trajectory evaluator and couples it with a Lipschitz-constrained planner trained via a Wasserstein-based regularization using synchronous coupling; a sub-optimality bound ties evaluator bias and dataset mismatch to the overall performance. The approach avoids bootstrapping instability typical of offline RL and provides theoretical guarantees on generalization while maintaining training stability. Extensive simulations and real-world A/B tests demonstrate state-of-the-art GMV gains and improved generalization to unseen advertisers, validating both the theoretical framework and practical algorithm. The method holds practical significance for scalable, reliable, and safe generative planning in large-scale online advertising systems.

Abstract

Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

TL;DR

This work tackles auto-bidding under a budget by augmenting a generative auto-bidding framework with a trajectory evaluator and a KL-Lipschitz-constrained score-maximization objective, enabling safe exploration beyond the offline dataset. The core idea, AIGB-Pearl, learns a supervised trajectory evaluator and couples it with a Lipschitz-constrained planner trained via a Wasserstein-based regularization using synchronous coupling; a sub-optimality bound ties evaluator bias and dataset mismatch to the overall performance. The approach avoids bootstrapping instability typical of offline RL and provides theoretical guarantees on generalization while maintaining training stability. Extensive simulations and real-world A/B tests demonstrate state-of-the-art GMV gains and improved generalization to unseen advertisers, validating both the theoretical framework and practical algorithm. The method holds practical significance for scalable, reliable, and safe generative planning in large-scale online advertising systems.

Abstract

Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

Paper Structure

This paper contains 44 sections, 10 theorems, 53 equations, 11 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

The trajectory quality $y(\tau)$ is $\sqrt{T}R_m$-Lipschitz continuous with respect to the Frobenius norm.

Figures (11)

  • Figure 1: AIGB-Pearl(Planning with EvaluAtor via RL) constructs a trajectory evaluator to score the trajectory quality and let the planner maximize the obtained score under the KL-Lipschitz constraint through continuous interaction with the evaluator. A synchronous coupling method is used to estimate the Wasserstein term in the Lipschitz penalty.
  • Figure 2: Trajectory Generation Visualization. Three cases are presented. Here, the AIGB-Pearl generates plausible trajectories, whereas its variant without the KL-Lipschitz constraint produces generations that significantly deviate from the reference and exhibit evident issues.
  • Figure 3: Examination of Evaluator Lipschitz.
  • Figure 4: Examination of Planner Lipschitz.
  • Figure 5: The impression opportunities within time step $t$ and $t+1$, where $p_t^i/v_t^i$ is the $1/\text{ROI}$ of impression $i$. Without loss of generality, consider two actions $a_{1,t}$ and $a_{2,t}$, and let $a_{2,t}\ge a_{1,t}$. The impressions within the shadow area are the impressions won by action $a_{2,t}$ but lost by action $a_{1,t}$.
  • ...and 6 more figures

Theorems & Definitions (17)

  • Definition 1: Trajectory and Trajectory Quality
  • Theorem 1: Lipschitz Continuous of $y(\tau)$.
  • Theorem 2: Evaluator Bias in Planning Performance Bound
  • Remark 1
  • Theorem 3: Sub-optimality Gap Bound
  • Remark 2
  • Theorem 1: Lipschitz Continuous of $y(\tau)$.
  • proof
  • Lemma 1: Additivity of the Lipschitz
  • proof
  • ...and 7 more