What Matters for Simulation to Online Reinforcement Learning on Real Robots

Yarden As; Dhruva Tirumala; René Zurbrügg; Chenhao Li; Stelian Coros; Andreas Krause; Markus Wulfmeier

What Matters for Simulation to Online Reinforcement Learning on Real Robots

Yarden As, Dhruva Tirumala, René Zurbrügg, Chenhao Li, Stelian Coros, Andreas Krause, Markus Wulfmeier

TL;DR

It is found that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware.

Abstract

We investigate what specific design choices enable successful online reinforcement learning (RL) on physical robots. Across 100 real-world training runs on three distinct robotic platforms, we systematically ablate algorithmic, systems, and experimental decisions that are typically left implicit in prior work. We find that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware. These results provide the first large-sample empirical study of such design choices, enabling practitioners to deploy online RL with lower engineering effort.

What Matters for Simulation to Online Reinforcement Learning on Real Robots

TL;DR

It is found that some widely used defaults can be harmful, while a set of robust, readily adopted design choices within standard RL practice yield stable learning across tasks and hardware.

Abstract

Paper Structure (45 sections, 8 equations, 18 figures)

This paper contains 45 sections, 8 equations, 18 figures.

Introduction
Our contribution.
Related Work
Background
Problem Setting
Markov decision processes.
Episodic online learning.
Priors.
Online Transfer
Sample efficiency.
Off-policy learning.
Approximate policy improvement.
Distribution shifts and the "downward spiral".
Stabilizing Learning Under Deployment Shifts
Data retention.
...and 30 more sections

Figures (18)

Figure 1: Robotic platforms studied in this work. We conduct our experiments on three robotic platforms spanning manipulation, locomotion and navigation robotic tasks. Manipulation. We use a Franka Emika Panda robot to locate, grasp and lift a cube to a goal position. The policy determines the end-effector's position and the gripper's opening given grayscale image observations. Locomotion. We use a Unitree Go1 robot to follow joystick commands. The policy maps randomly sampled linear and angular velocity commands (expressed in the robot's local coordinate frame) to joint position targets. Navigation. Finally, we use a remote-controlled race car that must park at a specified goal position as quickly as possible. This task is particularly challenging due to the system's high agility, fast control loop (60 Hz) and the difficulty of accurately modeling tire friction and drifting behavior.
Figure 2: Off-policy algorithms may lose stability due to approximation errors in action-value functions, leading to unlearning of the prior policy $\pi_0$ during online learning.
Figure 3: ownward spiral on a simulated Race Car robot under a mild dynamics mismatch. Left: Performance during learning using vanilla Soft Actor-Critic ("Unstable") and our approach ("Stable"). Right: Time-series histograms of the empirical estimation of errors $\epsilon(s, a)$ over the course of online learning. Concretely, denote $Q^{\pi_n}_{\text{MC}}(s_t, a_t)$ as the Monte Carlo estimate of the real action value of $\pi_n$ and $Q^{\pi_n}_{\phi}(s_t, a_t)$ as the learned approximation of it. For each episode $n = 0, \dots, N - 1$, we compute $\epsilon(s_t, a_t) \approx Q^{\pi_n}_{\phi}(s_t, a_t) - Q^{\pi_n}_{\text{MC}}(s_t, a_t) \;\forall s_t, a_t \in \mathcal{D}_n$. We compute the histogram of these values after every episode and represent their log counts as intensity in the two right plots. As shown, the learned action-value function $Q^{\pi_n}_\phi$ overestimates $Q^{\pi_n}_{\text{MC}}(s_t, a_t)$ in a large portion of states that are inserted to $\mathcal{D}_n$. In contrast, for the stable run, most of the mass concentrates just slightly above zero, indicating low errors and therefore improved learning stability.
Figure 4: Learning curves Soft Actor-Critic under mismatch in the dynamics. In the Franka Emika Panda robot, we replace the cube with a soft red ball. We first pretrain on a semi-kinematic bicycle model and finetune on more realistic dynamics that account for tire friction kabzan2020amz. In the Unitree G1 zakka2025mujoco and Go1 robots, we reduce the ground friction. We ablate $M \in \{20, 10, 5, 1\}$, showing significant stability improvements as we increase $M$ and reduce learning rate from $3\times10^{-4} \rightarrow 1\times10^{-5}$ across all robots. In \ref{['sec:additional-experiments']} we show similar results using TD3 of fujimoto2018addressing.
Figure 5: Comparison of learning performance and runtime for different configurations. Top: Unitree Go1 experiments. Bottom: Franka Emika Panda experiments. Increasing UTD requires less environment steps however at the price of longer training time.
...and 13 more figures

What Matters for Simulation to Online Reinforcement Learning on Real Robots

TL;DR

Abstract

What Matters for Simulation to Online Reinforcement Learning on Real Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (18)