Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

Xulin Chen; Ruipeng Liu; Zhenyu Gan; Garrett E. Katz

Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

Xulin Chen, Ruipeng Liu, Zhenyu Gan, Garrett E. Katz

TL;DR

This work addresses the sim_to_real robustness gap in reinforcement learning by introducing PPO_PGDLC, which combines projected gradient descent based worst_case value estimation with a Lipschitz_regularized critic within the PPO framework. The method aims to produce policies that maintain performance under uncertain transition dynamics while delivering smoother actions for reliable hardware deployment. Through experiments on classic control benchmarks and a Unitree Go2 locomotion task, PPO_PGDLC demonstrates improved robustness to dynamics perturbations and smoother control, showing effectiveness of critic_smoothness in robust policy learning. The results indicate practical benefits for sim_to_real transfer and offer insights into task_specific regularization trade_offs for robust actor_critic learning.

Abstract

Uncertainties in transition dynamics pose a critical challenge in reinforcement learning (RL), often resulting in performance degradation of trained policies when deployed on hardware. Many robust RL approaches follow two strategies: enforcing smoothness in actor or actor-critic modules with Lipschitz regularization, or learning robust Bellman operators. However, the first strategy does not investigate the impact of critic-only Lipschitz regularization on policy robustness, while the second lacks comprehensive validation in real-world scenarios. Building on this gap and prior work, we propose PPO-PGDLC, an algorithm based on Proximal Policy Optimization (PPO) that integrates Projected Gradient Descent (PGD) with a Lipschitz-regularized critic (LC). The PGD component calculates the adversarial state within an uncertainty set to approximate the robust Bellman operator, and the Lipschitz-regularized critic further improves the smoothness of learned policies. Experimental results on two classic control tasks and one real-world robotic locomotion task demonstrates that, compared to several baseline algorithms, PPO-PGDLC achieves better performance and predicts smoother actions under environmental perturbations.

Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

TL;DR

Abstract

Paper Structure (26 sections, 13 equations, 4 figures, 4 tables, 2 algorithms)

This paper contains 26 sections, 13 equations, 4 figures, 4 tables, 2 algorithms.

INTRODUCTION
RELATED WORK
Robust Reinforcement Learning
Lipschitz Continuity and Regularization
PRELIMINARIES
Lipschitz Continuity and Regularization
Robust Markov Decision Process
METHODOLOGY
Problem Formulation: Worst-Case Value Estimation
PGD and Critic Lipschitz Regularization
Practical Implementation
EXPERIMENTS
Metric for Quantifying Policy Robustness
Experiments on Classic Control Tasks
Baseline Algorithms
...and 11 more sections

Figures (4)

Figure 1: Overview of the PPO-PGDLC framework. PGD estimates the worst-case value within a bounded uncertainty set, while the LC enforces local smoothness in the critic to stabilize updates. Their integration within PPO enhances policy robustness to transition dynamics uncertainty and improves sim-to-real reliability.
Figure 2: The training curves (left) and heatmaps (right) for Cartpole (top row) and Ant (bottom row), where $\epsilon=0.003$ for PPO-GBR, PPO-PGD and PPO-PGDLC. When drawing the heatmaps, we collect totally 80 episodes with 4 trained policies and average their episode rewards for each environment. Redder color indicates higher reward. Overall, PPO-PGDLC maintains higher returns and greater robustness across mass–friction perturbations compared with all baselines, highlighting the benefit of integrating PGD-based value estimation with a Lipschitz-regularized critic.
Figure 3: Radar charts visualizing AS, SFR and VTE across different command velocities. Each vertex represents a specific command velocity, with the distance from the center indicating the metric value. For all metrics, a smaller value denotes better. PPO-PGDLC with $\lambda = 5\times10^{-5}$ achieves the lowest AS, SFR, and VTE across both trotting and bounding gaits, demonstrating smoother control and more accurate velocity tracking compared with the baseline policies.
Figure 4: PPO-PGDLC policy on Go2 hardware, with a 2kg (top) and 4kg (bottom) payload and the command $v_x^\text{cmd}=1\text{m/s}$.

Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

TL;DR

Abstract

Lipschitz-Regularized Critics Lead to Policy Robustness Against Transition Dynamics Uncertainty

Authors

TL;DR

Abstract

Table of Contents

Figures (4)