Table of Contents
Fetching ...

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Zhuoran Zhuang, Ye Chen, Xia Zeng, Chao Luo, Luhui Liu, Yihan Chen

TL;DR

Proactive price negotiation in OTAs requires aligning long-horizon persuasion with strict SOPs and verifiable numerics, a challenge not fully met by SFT or single-signal RL. The authors propose Reward-Enhanced Policy Optimization (REPO), which jointly aggregates three reward sources—the Preference-trained Reward Model ($RM$), the Reward Judge ($RJ$), and Programmatic Reward Functions ($RF$)—via a stability-preserving modulation, and trains efficiently with LoRA and per-trajectory rewards, guided by Generalized Advantage Estimation ($GAE$) and a Value Model. Across online and bad-case evaluations, REPO outperforms SFT, DPO, PPO, and GRPO on dialogue quality, incidence of excellent responses, and bad-case fix rates, with emergent capabilities such as proactive empathy and calibrated tactics. The work demonstrates that careful integration of heterogeneous signals yields practical improvements in industrial-scale, long-horizon negotiations and offers a blueprint for applying similar multi-reward approaches to other constrained, interaction-heavy tasks in real-world domains.

Abstract

We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

TL;DR

Proactive price negotiation in OTAs requires aligning long-horizon persuasion with strict SOPs and verifiable numerics, a challenge not fully met by SFT or single-signal RL. The authors propose Reward-Enhanced Policy Optimization (REPO), which jointly aggregates three reward sources—the Preference-trained Reward Model (), the Reward Judge (), and Programmatic Reward Functions ()—via a stability-preserving modulation, and trains efficiently with LoRA and per-trajectory rewards, guided by Generalized Advantage Estimation () and a Value Model. Across online and bad-case evaluations, REPO outperforms SFT, DPO, PPO, and GRPO on dialogue quality, incidence of excellent responses, and bad-case fix rates, with emergent capabilities such as proactive empathy and calibrated tactics. The work demonstrates that careful integration of heterogeneous signals yields practical improvements in industrial-scale, long-horizon negotiations and offers a blueprint for applying similar multi-reward approaches to other constrained, interaction-heavy tasks in real-world domains.

Abstract

We study deploying large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs), where aligning traveler affordability and hotel profitability directly affects bookings, partner relationships, and access to travel. The agent must follow a Standard Operating Procedure (SOP) while conducting multi-turn persuasion, interpreting colloquial inputs, and adhering to guardrails (no over-promising, no hallucinations). Conventional post-training -- supervised fine-tuning (SFT) or single-source reward optimization -- overfits scripts, misses nuanced persuasive style, and fails to enforce verifiable business constraints. We propose Reward-Enhanced Policy Optimization (REPO), a reinforcement learning post-training framework that aligns an LLM with heterogeneous rewards: a preference-trained reward model (RM) for dense human alignment, a reward judge (RJ) for high-level persuasive behavior and SOP compliance, and programmatic reward functions (RF) for deterministic checks on numerics, formatting, and guardrails. A straightforward enhancement mechanism is proposed to combine the RM with RJ and RF signals to curb reward hacking and improve negotiation quality. In production-style evaluations -- approximately 150 turns from real dialogues and 225 turns from curated bad-case dialogues -- REPO lifts average dialogue rating to 4.63: +1.20 over base, +0.83 over Direct Preference Optimization (DPO); +0.33 over Group Relative Policy Optimization (GRPO), increases the share of conversations with at least one excellent response to 66.67% (+23.34 percentage points over GRPO), and achieves a 93.33% bad-case fix rate with 75.56% clean fixes, outperforming SFT, DPO, PPO, and GRPO. We also observe emergent capabilities -- proactive empathy, localized reasoning, calibrated tactics -- that surpass gold annotations.

Paper Structure

This paper contains 25 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Framework of REPO for Alignment from Heterogeneous Reward Signals. Given a query ($q$), the Policy Model generates an output ($o$), which is evaluated through three reward components: the Reward Model (RM), providing a dense, human preference-aligned signal; the Reward Judge (RJ), an LLM-based evaluator scoring nuanced, high-level behaviors; and programmatic Reward Functions (RF), validating task-specific requirements with deterministic checks. These signals are combined to compute the total reward ($r$), used to guide learning via Generalized Advantage Estimation (GAE) to calculate the Advantage ($A$). The Value Model predicts the state value ($v$) to further refine the training process.
  • Figure 2: Simplified SOP for the Price Negotiation Task.
  • Figure 3: Summary of Model Performance Comparison.
  • Figure 4: Learning Curve of Negotiation Persuasion Capability.
  • Figure 5: Good-case Rate Heatmap on Fine-grained Conversational Skills.