Table of Contents
Fetching ...

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Elham Daneshmand, Shafeef Omar, Glen Berseth, Majid Khadiv, Hsiu-Chin Lin

Abstract

Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a $46.5\%$ reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Abstract

Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.
Paper Structure (21 sections, 2 equations, 9 figures, 2 tables)

This paper contains 21 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The $SLowRL$ architecture. A frozen main policy and a trainable LoRA adapter operate in parallel to generate the main action. A Safety Filter monitors the robot's state and selects between the main action and a conservative Recovery Policy action to ensure safe operation during fine-tuning.
  • Figure 2: Detailed schematic of the $SLowRL$ framework. We freeze the pre-trained policy parameters (blue dashed blocks) to retain prior knowledge from IsaacLab. To enable adaptation to the target environment, we inject low-rank trainable adapters (green blocks) in parallel to the frozen weights. The outputs are summed before the ELU activation, ensuring a safe exploration process.
  • Figure 3: The Unitree Go2 robot performs a dynamic jump. Using $SLowRL$ , we fine-tune a low-rank adapter on top of a frozen simulation-trained policy, achieving substantially faster training with near-zero safety violations compared to standard PPO.
  • Figure 4: Sim-to-Sim Sample Efficiency during Fine-tuning (Trot Task) Comparative learning curves showing mean reward over wall-clock time for the trotting task. $SLowRL$ (blue) significantly outperforms the FFT and Zero-Shot baselines, achieving a $38\%$ reduction in time-to-convergence while maintaining higher stability.
  • Figure 5: Sim-to-Sim Sample Efficiency during Fine-tuning (Jump Task) Performance comparison for the dynamic jumping task on the Unitree Go2 robot. $SLowRL$ demonstrates superior sample efficiency, reducing convergence time by $55\%$ compared to baselines without incurring the safety violations observed in FFT.
  • ...and 4 more figures