Partial Policy Gradients for RL in LLMs

Puneet Mathur; Branislav Kveton; Subhojyoti Mukherjee; Viet Dac Lai

Partial Policy Gradients for RL in LLMs

Puneet Mathur, Branislav Kveton, Subhojyoti Mukherjee, Viet Dac Lai

TL;DR

This work proposes a natural approach for modeling policy structure in policy gradients to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate.

Abstract

Reinforcement learning is a framework for learning to act sequentially in an unknown environment. We propose a natural approach for modeling policy structure in policy gradients. The key idea is to optimize for a subset of future rewards: smaller subsets represent simpler policies, which can be learned more reliably because their empirical gradient estimates are more accurate. Our approach allows for modeling and comparison of different policy classes, including full planning, greedy, K-step lookahead, and segment policies. We evaluate the policies empirically on multiple persona-alignment conversational problems. Different policies excel in different problems, reflecting their different characteristics and highlighting the importance of our studied policy class.

Partial Policy Gradients for RL in LLMs

TL;DR

Abstract

Paper Structure (57 sections, 5 theorems, 31 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 57 sections, 5 theorems, 31 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Setting
Algorithms
Policy Gradient
Partial Policy Gradient
Interpretation
Offline Partial Policy Gradient
Partial Policy Gradient Instances
Full Policy Gradient
Greedy Policy Gradient
$K$-Step Lookahead Policy Gradient
Experiments
Data Settings
Evaluation
Baselines and Compared Methods
...and 42 more sections

Key Result

Lemma 1

Let $f(x, \tau_t)$ be any function of $x$ and $\tau_t$ such that $0 = f(x, \tau_0) \leq \dots \leq f(x, \tau_n) = r(x, \tau_n)$. Let hold for all $t \in [n]$. Then $\sum_{t = 1}^n r_t = r(x, \tau_n)$ and $r_t \geq 0$ for all $t \in [n]$.

Figures (7)

Figure 1: Examples of reward indices $\mathcal{R}_t$ (yellow circles in columns) and action indices $\mathcal{S}_t$ (blue circles in rows) in full, greedy, and $2$-step lookahead policies.
Figure 2: Residual of persona consistency scaled to $[-1, 1]$ across trajectory steps for all domains with Qwen. $K = 1$ represents $\color{Green}\tt GreedyPG$ and $K \in \{2, 3, 4, 5\}$ represents $\color{Green}\tt K- Step- PG$.
Figure 3: Persona consistency as a function of sample size (number of training trajectories) for all domains with Llama. $K = 1$ represents $\color{Green}\tt GreedyPG$ and $K \in \{2, 3, 4, 5\}$ represents $\color{Green}\tt K- Step- PG$.
Figure 4: Scaling Laws for $\color{Green}\tt PPG$: Across different LLMs (Llama-3.1-8B-Instruct, Qwen3-8B, and Gemma-7B-it), optimal value of lookahead $K$ in $\color{Green}\tt K- Step- PG$ scales with available training trajectories.
Figure 5: Persona consistency of policy gradient methods vs trajectory length in education, therapy, and chatting domains for Llama3.1-8B-Instruct model. For each method, we report mean PC of all trajectories with $t$ steps, where $t=\{10, 20, 40, 60\}$.
...and 2 more figures

Theorems & Definitions (6)

Lemma 1
Lemma 2
Lemma 3
Theorem 4
Theorem 5
proof

Partial Policy Gradients for RL in LLMs

TL;DR

Abstract

Partial Policy Gradients for RL in LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)