Table of Contents
Fetching ...

What Matters for Simulation to Online Reinforcement Learning on Real Robots

Yarden As, Dhruva Tirumala, René Zurbrügg, Chenhao Li, Stelian Coros, Andreas Krause, Markus Wulfmeier

TL;DR

It is found that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware.

Abstract

We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.

What Matters for Simulation to Online Reinforcement Learning on Real Robots

TL;DR

It is found that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware.

Abstract

We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.
Paper Structure (45 sections, 8 equations, 18 figures)

This paper contains 45 sections, 8 equations, 18 figures.

Figures (18)

  • Figure 1: Robotic platforms studied in this work. We conduct our experiments on three robotic platforms spanning manipulation, locomotion and navigation robotic tasks. Manipulation. We use a Franka Emika Panda robot to locate, grasp and lift a cube to a goal position. The policy determines the end-effector's position and the gripper's opening given grayscale image observations. Locomotion. We use a Unitree Go1 robot to follow joystick commands. The policy maps randomly sampled linear and angular velocity commands (expressed in the robot's local coordinate frame) to joint position targets. Navigation. Finally, we use a remote-controlled race car that must park at a specified goal position as quickly as possible. This task is particularly challenging due to the system's high agility, fast control loop (60 Hz) and the difficulty of accurately modeling tire friction and drifting behavior.
  • Figure 2: Off-policy algorithms may lose stability due to approximation errors in action-value functions, leading to unlearning of the prior policy $\pi_0$ during online learning.
  • Figure 3: ownward spiral on a simulated Race Car robot under a mild dynamics mismatch. Left: Performance during learning using vanilla Soft Actor-Critic ("Unstable") and our approach ("Stable"). Right: Time-series histograms of the empirical estimation of errors $\epsilon(s, a)$ over the course of online learning. Concretely, denote $Q^{\pi_n}_{\text{MC}}(s_t, a_t)$ as the Monte Carlo estimate of the real action value of $\pi_n$ and $Q^{\pi_n}_{\phi}(s_t, a_t)$ as the learned approximation of it. For each episode $n = 0, \dots, N - 1$, we compute $\epsilon(s_t, a_t) \approx Q^{\pi_n}_{\phi}(s_t, a_t) - Q^{\pi_n}_{\text{MC}}(s_t, a_t) \;\forall s_t, a_t \in \mathcal{D}_n$. We compute the histogram of these values after every episode and represent their log counts as intensity in the two right plots. As shown, the learned action-value function $Q^{\pi_n}_\phi$ overestimates $Q^{\pi_n}_{\text{MC}}(s_t, a_t)$ in a large portion of states that are inserted to $\mathcal{D}_n$. In contrast, for the stable run, most of the mass concentrates just slightly above zero, indicating low errors and therefore improved learning stability.
  • Figure 4: Learning curves Soft Actor-Critic under mismatch in the dynamics. In the Franka Emika Panda robot, we replace the cube with a soft red ball. We first pretrain on a semi-kinematic bicycle model and finetune on more realistic dynamics that account for tire friction kabzan2020amz. In the Unitree G1 zakka2025mujoco and Go1 robots, we reduce the ground friction. We ablate $M \in \{20, 10, 5, 1\}$, showing significant stability improvements as we increase $M$ and reduce learning rate from $3\times10^{-4} \rightarrow 1\times10^{-5}$ across all robots. In \ref{['sec:additional-experiments']} we show similar results using TD3 of fujimoto2018addressing.
  • Figure 5: Comparison of learning performance and runtime for different configurations. Top: Unitree Go1 experiments. Bottom: Franka Emika Panda experiments. Increasing UTD requires less environment steps however at the price of longer training time.
  • ...and 13 more figures