Table of Contents
Fetching ...

Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

Dylan Khor, Bowen Weng

TL;DR

This work addresses the post-convergence sim-to-real transfer problem, where policy selection after RL training remains heuristic. It introduces a principled worst-case performance predictor by constructing a KL-divergence–bounded neighborhood around the simulator distribution $q$ and solving a convex Quadratic-Constrained Linear Program to obtain the worst-case distribution $\rho$. Theoretical results show that this worst-case estimate reduces variance and improves ranking robustness relative to simulator-based estimates, providing a more reliable indicator of real-world performance under $p$. Empirical validation with Unitree G1 locomotion demonstrates that, under both undisturbed and disturbed conditions, the worst-case predictor better aligns policy rankings with real-world outcomes than traditional simulation or adversarial indicators, and in some cases achieves strong correlations (SCC up to about 0.9). Overall, the approach offers a principled, data-efficient mechanism for post-convergence policy selection with practical implications for real-world robotics deployments.

Abstract

Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.

Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking

TL;DR

This work addresses the post-convergence sim-to-real transfer problem, where policy selection after RL training remains heuristic. It introduces a principled worst-case performance predictor by constructing a KL-divergence–bounded neighborhood around the simulator distribution and solving a convex Quadratic-Constrained Linear Program to obtain the worst-case distribution . Theoretical results show that this worst-case estimate reduces variance and improves ranking robustness relative to simulator-based estimates, providing a more reliable indicator of real-world performance under . Empirical validation with Unitree G1 locomotion demonstrates that, under both undisturbed and disturbed conditions, the worst-case predictor better aligns policy rankings with real-world outcomes than traditional simulation or adversarial indicators, and in some cases achieves strong correlations (SCC up to about 0.9). Overall, the approach offers a principled, data-efficient mechanism for post-convergence policy selection with practical implications for real-world robotics deployments.

Abstract

Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.

Paper Structure

This paper contains 14 sections, 1 theorem, 8 equations, 7 figures.

Key Result

Theorem 1

Let $p_1$ and $p_2$ be two unknown probability distributions satisfying $\mathbb{E}_{p_1}[\psi] > \mathbb{E}_{p_2}[\psi]$ for some performance measure function $\psi$. Let $q_1$ and $q_2$ be approximations of $p_1$ and $p_2$, respectively, with unknown discrepancies. Let $\rho_1$ and $\rho_2$ be the

Figures (7)

  • Figure 1: The average training reward of training a locomotion policy using RL for the Unitree G1 robot in Issac Gym simulator isaacgym. The program is a re-execution of unitreegithub with two different random seeds (1 and 50). The highlighted policies (with red and green dashed segments) are further analyzed in Section \ref{['sec:exp']} later.
  • Figure 2: An overview of the main steps in the proposed post-convergence sim-to-real performance indication.
  • Figure 3: Visual representations of the undisturbed and disturbed testing setups in simulator and real-world. 13 policies (see red segments in Fig. \ref{['fig:train_logs']}) and 5 policies (see green segments in Fig. \ref{['fig:train_logs']}) are selected, respectively, for the undisturbed and disturbed tests.
  • Figure 4: Comparison of predicted policy rankings from various sim-to-real performance indicators against real-world evaluated rankings in the undisturbed testing group, measured using Spearman’s correlation coefficient (SCC) across two different RHW levels (confidence level $0.05$). The proposed worst-case estimates exhibit some variation depending on the specified KL-divergence bound $k$. In contrast, other indicators, such as direct simulation-based estimates and adversarial simulation evaluations, remain constant and appear as flat lines.
  • Figure 5: Comparison of predicted policy rankings from various sim-to-real performance indicators against real-world evaluated rankings in the disturbed testing group using the stability reward function $r_2$, measured using Spearman’s correlation coefficient (SCC) at the RHW level of $0.03$ (with confidence level $0.05$): (a) and (b) are using the disturbed and undisturbed testing environment, respectively, in simulation and its corresponding worst-case estimates to indicate real-world performance under disturbances.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Remark 1
  • Theorem 1
  • proof
  • Remark 2