Table of Contents
Fetching ...

Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

Xuerui Su, Shufang Xie, Guoqing Liu, Yingce Xia, Renqian Luo, Peiran Jin, Zhiming Ma, Yue Wang, Zun Wang, Yuting Liu

TL;DR

The paper introduces Trust Region Preference Approximation (TRPA), a simple, stable RL framework for improving LLM reasoning by fusing rule-based preference levels with preference-based policy optimization. It provides a theoretical basis via Posterior Boltzmann Approximation, including a monotonic improvement guarantee toward a target distribution, and enhances it with Kahneman-Tversky-inspired weighting and a KL-based trust region. Empirical results on logic reasoning and mathematical tasks show competitive performance and strong training stability, with insightful ablations highlighting the value of KTPO and prompt-wised optimization. The work positions TRPA as a practical alternative to reward-based RLHF methods, with potential for broader AI-for-Science applications.

Abstract

Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.

Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning

TL;DR

The paper introduces Trust Region Preference Approximation (TRPA), a simple, stable RL framework for improving LLM reasoning by fusing rule-based preference levels with preference-based policy optimization. It provides a theoretical basis via Posterior Boltzmann Approximation, including a monotonic improvement guarantee toward a target distribution, and enhances it with Kahneman-Tversky-inspired weighting and a KL-based trust region. Empirical results on logic reasoning and mathematical tasks show competitive performance and strong training stability, with insightful ablations highlighting the value of KTPO and prompt-wised optimization. The work positions TRPA as a practical alternative to reward-based RLHF methods, with potential for broader AI-for-Science applications.

Abstract

Recently, Large Language Models (LLMs) have rapidly evolved, approaching Artificial General Intelligence (AGI) while benefiting from large-scale reinforcement learning to enhance Human Alignment (HA) and Reasoning. Recent reward-based optimization algorithms, such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) have achieved significant performance on reasoning tasks, whereas preference-based optimization algorithms such as Direct Preference Optimization (DPO) significantly improve the performance of LLMs on human alignment. However, despite the strong performance of reward-based optimization methods in alignment tasks , they remain vulnerable to reward hacking. Furthermore, preference-based algorithms (such as Online DPO) haven't yet matched the performance of reward-based optimization algorithms (like PPO) on reasoning tasks, making their exploration in this specific area still a worthwhile pursuit. Motivated by these challenges, we propose the Trust Region Preference Approximation (TRPA) algorithm, which integrates rule-based optimization with preference-based optimization for reasoning tasks. As a preference-based algorithm, TRPA naturally eliminates the reward hacking issue. TRPA constructs preference levels using predefined rules, forms corresponding preference pairs, and leverages a novel optimization algorithm for RL training with a theoretical monotonic improvement guarantee. Experimental results demonstrate that TRPA not only achieves competitive performance on reasoning tasks but also exhibits robust stability. The code of this paper are released and updating on https://github.com/XueruiSu/Trust-Region-Preference-Approximation.git.

Paper Structure

This paper contains 20 sections, 5 theorems, 36 equations, 4 figures, 2 tables.

Key Result

Lemma 4.2

The Online DPO algorithm is not a PBA algorithm.

Figures (4)

  • Figure 1: (a) Trust Region Preference Approximation algorithm. (b) Reward-based Optimization algorithms with two reward scenarios, where Rule-based Optimization algorithms construct the reward function based on predefined rules (e.g., DeepSeek R1 DeepSeekR1), and Learnable Reward-based Optimization algorithms implement training of a learnable reward model for subsequent RL training stages (e.g., InstructGPT ouyang2022training). (c) Preference-based Optimization algorithms.
  • Figure 2: (a) Contour map of the cross-entropy loss function. (b) Contour map of the KL loss function. (c) Comparison of the KL loss and cross-entropy loss when $p=q$.
  • Figure 3: Comparison of TRPA (Blue), TRPA w/oKTPO (Purple), Online DPO (Green) and GRPO (Pink) in terms of (a) Accuracy, (b) Response Length, and (c) Entropy (Response Length and Entropy are averaged by sliding window = 400).
  • Figure 4: The experiment on K&K logic puzzle dataset with number of people = 3. (a). Accuracy and Response Length, (b) Entropy, and (c). Logit Ratio $\log(\pi_\theta/\pi_{\text{ref}})$.

Theorems & Definitions (6)

  • Definition 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Theorem 4.4
  • Lemma B.1
  • Lemma B.2