Table of Contents
Fetching ...

Residual Policy Gradient: A Reward View of KL-regularized Objective

Pengcheng Wang, Xinghao Zhu, Yuxin Chen, Chenfeng Xu, Masayoshi Tomizuka, Chenran Li

TL;DR

This work introduces Residual Policy Gradient (RPG), a gradient-based extension of Residual Q-Learning (RQL) for policy customization, and derives a concise Soft Policy Gradient as its foundation. By recasting KL-regularized objectives as reward-level trade-offs and proposing a decoupled reward form, RPG provides flexible tuning via parameters like $\omega'$ and $\hat{\alpha}$, rather than relying solely on KL constraints. The authors present Soft PPO and Residual PPO as practical instantiations that incorporate the augmented reward structure into actor-critic updates, demonstrating improved add-on-task performance and balanced trade-offs in MuJoCo locomotion tasks. The results offer theoretical and empirical support for viewing RLHF-like fine-tuning and policy customization through a reward-centric lens, with implications for more adaptable and stable deployment of learned policies.

Abstract

Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.

Residual Policy Gradient: A Reward View of KL-regularized Objective

TL;DR

This work introduces Residual Policy Gradient (RPG), a gradient-based extension of Residual Q-Learning (RQL) for policy customization, and derives a concise Soft Policy Gradient as its foundation. By recasting KL-regularized objectives as reward-level trade-offs and proposing a decoupled reward form, RPG provides flexible tuning via parameters like and , rather than relying solely on KL constraints. The authors present Soft PPO and Residual PPO as practical instantiations that incorporate the augmented reward structure into actor-critic updates, demonstrating improved add-on-task performance and balanced trade-offs in MuJoCo locomotion tasks. The results offer theoretical and empirical support for viewing RLHF-like fine-tuning and policy customization through a reward-centric lens, with implications for more adaptable and stable deployment of learned policies.

Abstract

Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.

Paper Structure

This paper contains 21 sections, 33 equations, 5 tables.