Non-Asymptotic Global Convergence of PPO-Clip
Yin Liu, Qiming Dai, Junyu Zhang, Zaiwen Wen
TL;DR
The paper analyzes deterministic actor-only PPO-Clip with general f-divergence regularization under softmax policies in infinite-horizon MDPs. It proves non-uniform Lipschitz smoothness and a Łojasiewicz inequality for the f-divergence-regularized objective, enabling non-asymptotic convergence results. Specifically, it shows a non-asymptotic global linear convergence rate for forward KL regularization and stationary plus local linear convergence for reverse KL regularization. The results provide rigorous convergence guarantees for PPO-Clip variants relevant to RLHF and LLM alignment, clarifying initialization and stepsize requirements to achieve global optimality or fast local convergence.
Abstract
Reinforcement learning (RL) has gained attention for aligning large language models (LLMs) via reinforcement learning from human feedback (RLHF). The actor-only variants of Proximal Policy Optimization (PPO) are widely applied for their efficiency. These algorithms incorporate a clipping mechanism to improve stability. Besides, a regularization term, such as the reverse KL-divergence or a more general \(f\)-divergence, is introduced to prevent policy drift. Despite their empirical success, a rigorous theoretical understanding of the problem and the algorithm's properties is limited. This paper advances the theoretical foundations of the PPO-Clip algorithm by analyzing a deterministic actor-only PPO algorithm within the general RL setting with \(f\)-divergence regularization under the softmax policy parameterization. We derive a non-uniform Lipschitz smoothness condition and a Łojasiewicz inequality for the considered problem. Based on these, a non-asymptotic linear convergence rate to the globally optimal policy is established for the forward KL-regularizer. Furthermore, stationary convergence and local linear convergence are derived for the reverse KL-regularizer.
