Table of Contents
Fetching ...

Token-level Proximal Policy Optimization for Query Generation

Yichen Ouyang, Lu Wang, Fangkai Yang, Pu Zhao, Chenghua Huang, Jianfeng Liu, Bochen Pang, Yaming Yang, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang

TL;DR

Token-level Proximal Policy Optimization (TPPO) is proposed, a noval approach designed to empower LLMs perform better in query generation through fine-tuning and significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

Abstract

Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

Token-level Proximal Policy Optimization for Query Generation

TL;DR

Token-level Proximal Policy Optimization (TPPO) is proposed, a noval approach designed to empower LLMs perform better in query generation through fine-tuning and significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

Abstract

Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage Large Language Models (LLMs) for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

Paper Structure

This paper contains 24 sections, 1 theorem, 14 equations, 13 figures, 4 tables.

Key Result

lemma 1

The optimization problem in Equation eq:optimization yields the optimal policy as given in Equation eq_3.

Figures (13)

  • Figure 1: Reward assignment in sentence-level PPO and token-level PPO (TPPO). Sentence-level PPO assigns reward only at the end of a response, whereas TPPO assigns reward for each token in a response.
  • Figure 2: The Query Generation Task. Taking user history as input, the LLM after RLAIF alignment outputs several personalized queries that the user is interested in.
  • Figure 3: Token-level reward labeling. In phase I, we use LLaMA 3 (70B) to label word-level and sentence-level rewards for the dataset. In phase II, we map the word-level rewards to token-level rewards. The model response and user history are used to construct input for token-level reward model and the mapped token rewards are used as the ground truth for output.
  • Figure 4: Objectives of token-level reward model. The position after masking (valid zone) is used to calculate the loss and return gradient. The loss of the token-level reward model is the weighted sum of local loss and global loss.
  • Figure 5: PPO Training Curves on Open-source Dataset.
  • ...and 8 more figures

Theorems & Definitions (1)

  • lemma 1