Table of Contents
Fetching ...

LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan, Bo An, Peng Jiang

TL;DR

A hierarchical Large autoBidding Model (LBM) is proposed to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy and an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods.

Abstract

The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.

LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

TL;DR

A hierarchical Large autoBidding Model (LBM) is proposed to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy and an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods.

Abstract

The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.
Paper Structure (19 sections, 14 equations, 6 figures, 9 tables)

This paper contains 19 sections, 14 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the training scheme for our hierarchical large auto-bidding model. Training Stage I (left): train the LBM-Act with language-guided decision training via a dual embedding mechanism to fuse two modalities; Training Stage II (right): reinforcement fine-tuning the LBM-Think via group relative-Q policy optimization.
  • Figure 2: Relationship between the CPA ratio and model behavior by 1000 random samples. When the CPA ratio exceeds 1, it indicates that the CPA constraint is not satisfied, suggesting a high likelihood of decreasing the bidding parameter, i.e., $\Delta a<0$. When the CPA ratio is smaller than 1, it will be more likely to increase the bidding parameter $\Delta a>0$. However, this relationship is not clearly reflected in DT and pretrained LLM, but is more evident in LLM finetuned with GQPO.
  • Figure 3: Training loss curve of LLM-DT and LBM-Act, supporting the efficiency of language-guided learning.
  • Figure 4: Illustration of the inference procedure of LBM with a detailed case of the prompt and CoT. For a timestep $t$, the LBM-Think first generates a CoT at timestep $t-1$ within the interval $\Delta t$. Then, when timestep $t$ arrives, the LBM-Act receives the pre-generated CoT along with the detailed numerical sequence information to generate an action, which involves adjusting the bidding parameter.
  • Figure 5: Visualization of attention scores varying across different positions and heads. For simplicity, we let positions from #0 to #6 as the language part, while the remaining positions are the numerical part. The two modalities are effectively fused using different attention heads. For instance, heads such as #4, #5, #6, and #10 in the final layer predominantly focus on the language part, whereas other heads are more attuned to the numerical parts. Interestingly, we observed that the attention in the middle layers distinctly separates the language parts (positions 0-6) from the numerical parts, with position 7 marking the beginning of the numerical part.
  • ...and 1 more figures