Table of Contents
Fetching ...

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

Yarden As, Chengrui Qu, Benjamin Unger, Dongho Kang, Max van der Hart, Laixi Shi, Stelian Coros, Adam Wierman, Andreas Krause

TL;DR

The paper tackles zero-shot safe transfer from simulation to real robots by addressing the sim-to-real safety gap. It introduces SPiDR, a pessimistic domain-randomization approach that augments the CMDP framework with a penalized cost, approximated in practice by ensemble disagreement to bound the real-world cost under model mismatch measured by the $L_1$-Wasserstein distance $D_W(\hat p_\xi, p^\star)(s,a)$. The key theoretical result shows that solving the penalized CMDP yields a policy satisfying the real-world safety constraint $C_{p^\star}(\pi) \le d$; SPiDR remains compatible with standard RL pipelines and scales to sim-to-sim and real-world tasks, including vision-based control. Empirically, SPiDR demonstrates safe zero-shot transfer on two real robotic platforms (Race Car and Unitree Go1) and strong performance across sim-to-sim benchmarks, with ablations highlighting robustness to the penalty parameter and ensemble size.

Abstract

Deploying reinforcement learning (RL) safely in the real world is challenging, as policies trained in simulators must face the inevitable sim-to-real gap. Robust safe RL techniques are provably safe, however difficult to scale, while domain randomization is more practical yet prone to unsafe behaviors. We address this gap by proposing SPiDR, short for Sim-to-real via Pessimistic Domain Randomization -- a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.

SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

TL;DR

The paper tackles zero-shot safe transfer from simulation to real robots by addressing the sim-to-real safety gap. It introduces SPiDR, a pessimistic domain-randomization approach that augments the CMDP framework with a penalized cost, approximated in practice by ensemble disagreement to bound the real-world cost under model mismatch measured by the -Wasserstein distance . The key theoretical result shows that solving the penalized CMDP yields a policy satisfying the real-world safety constraint ; SPiDR remains compatible with standard RL pipelines and scales to sim-to-sim and real-world tasks, including vision-based control. Empirically, SPiDR demonstrates safe zero-shot transfer on two real robotic platforms (Race Car and Unitree Go1) and strong performance across sim-to-sim benchmarks, with ablations highlighting robustness to the penalty parameter and ensemble size.

Abstract

Deploying reinforcement learning (RL) safely in the real world is challenging, as policies trained in simulators must face the inevitable sim-to-real gap. Robust safe RL techniques are provably safe, however difficult to scale, while domain randomization is more practical yet prone to unsafe behaviors. We address this gap by proposing SPiDR, short for Sim-to-real via Pessimistic Domain Randomization -- a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.

Paper Structure

This paper contains 75 sections, 7 theorems, 42 equations, 23 figures, 2 tables.

Key Result

Lemma 4.1

Let $\mathbb{P}_{p,\pi,t}(s)$ denote the probability of reaching the state $s$ at step $t$ under the policy $\pi$ and the dynamics $p$, and let $d_{p,\pi}\triangleq(1-\gamma)\pi(a|s)\sum_{t=0}^{\infty}\gamma^t \mathbb{P}_{p,\pi,t}(s)$ denote the normalized discounted occupancy measure of policy $\pi where $L_C$ is the Lipschitz constant of the state cost function $V_c^{p^\star,\pi}(s)$.

Figures (23)

  • Figure 1: Uncertainty over a quadruped robot’s trajectory. The snapshots illustrate the robot's pose at key moments, with corresponding uncertainty levels highlighted. High-uncertainty transitions are incorporated into the cost function to discourage the policy from entering regions where the simulator is inaccurate and behavior is more likely to become unsafe during real-world deployment.
  • Figure 2: Example trajectories SPiDR with RaceCar and Unitree Go1.
  • Figure 3: Performance on the race car and Unitree Go1. SPiDR (sim-to-sim) and SPiDR (sim-to-real) represent evaluation in simulation and on the real system respectively. SPiDR transfers safely, while domain randomization dramatically violates the safety constraints.
  • Figure 4: Performance is maintained on the Unitree Go1.
  • Figure 5: Safe transfer to a real Unitree Go1 with PPO.
  • ...and 18 more figures

Theorems & Definitions (13)

  • Lemma 4.1
  • Theorem 4.2
  • Example A.1
  • proof
  • Lemma B.1: Telescoping lemma
  • proof
  • proof
  • Definition C.1: $L_1$-Wasserstein distance clark
  • Lemma C.5: Bernstein transportation talebi2018variance
  • Lemma C.6: Variance bound for change of measure menard_fast_2021
  • ...and 3 more