Post-Convergence Sim-to-Real Policy Transfer: A Principled Alternative to Cherry-Picking
Dylan Khor, Bowen Weng
TL;DR
This work addresses the post-convergence sim-to-real transfer problem, where policy selection after RL training remains heuristic. It introduces a principled worst-case performance predictor by constructing a KL-divergence–bounded neighborhood around the simulator distribution $q$ and solving a convex Quadratic-Constrained Linear Program to obtain the worst-case distribution $\rho$. Theoretical results show that this worst-case estimate reduces variance and improves ranking robustness relative to simulator-based estimates, providing a more reliable indicator of real-world performance under $p$. Empirical validation with Unitree G1 locomotion demonstrates that, under both undisturbed and disturbed conditions, the worst-case predictor better aligns policy rankings with real-world outcomes than traditional simulation or adversarial indicators, and in some cases achieves strong correlations (SCC up to about 0.9). Overall, the approach offers a principled, data-efficient mechanism for post-convergence policy selection with practical implications for real-world robotics deployments.
Abstract
Learning-based approaches, particularly reinforcement learning (RL), have become widely used for developing control policies for autonomous agents, such as locomotion policies for legged robots. RL training typically maximizes a predefined reward (or minimizes a corresponding cost/loss) by iteratively optimizing policies within a simulator. Starting from a randomly initialized policy, the empirical expected reward follows a trajectory with an overall increasing trend. While some policies become temporarily stuck in local optima, a well-defined training process generally converges to a reward level with noisy oscillations. However, selecting a policy for real-world deployment is rarely an analytical decision (i.e., simply choosing the one with the highest reward) and is instead often performed through trial and error. To improve sim-to-real transfer, most research focuses on the pre-convergence stage, employing techniques such as domain randomization, multi-fidelity training, adversarial training, and architectural innovations. However, these methods do not eliminate the inevitable convergence trajectory and noisy oscillations of rewards, leading to heuristic policy selection or cherry-picking. This paper addresses the post-convergence sim-to-real transfer problem by introducing a worst-case performance transference optimization approach, formulated as a convex quadratic-constrained linear programming problem. Extensive experiments demonstrate its effectiveness in transferring RL-based locomotion policies from simulation to real-world laboratory tests.
