SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Elham Daneshmand; Shafeef Omar; Glen Berseth; Majid Khadiv; Hsiu-Chin Lin

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Elham Daneshmand, Shafeef Omar, Glen Berseth, Majid Khadiv, Hsiu-Chin Lin

Abstract

Sim-to-real transfer of locomotion policies often leads to performance degradation due to the inevitable sim-to-real gap. Naively fine-tuning these policies directly on hardware is problematic, as it poses risks of mechanical failure and suffers from high sample inefficiency. In this paper, we address the challenge of safely and efficiently fine-tuning reinforcement learning (RL) policies for dynamic locomotion tasks. Specifically, we focus on fine-tuning policies learned in simulation directly on hardware, while explicitly enforcing safety constraints. In doing so, we introduce SLowRL, a framework that combines Low-Rank Adaptation (LoRA) with training-time safety enforcement via a recovery policy. We evaluate our method both in simulation and on a real Unitree Go2 quadruped robot for jump and trot tasks. Experimental results show that our method achieves a $46.5\%$ reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Abstract

reduction in fine-tuning time and near-zero safety violations compared to standard proximal policy optimization (PPO) baselines. Notably, we find that a rank-1 adaptation alone is sufficient to recover pre-trained performance in the real world, while maintaining stable and safe real-world fine-tuning. These results demonstrate the practicality of safe, efficient fine-tuning for dynamic real-world robotic applications.

Paper Structure (21 sections, 2 equations, 9 figures, 2 tables)

This paper contains 21 sections, 2 equations, 9 figures, 2 tables.

Introduction
Related Work
Background
Reinforcement Learning Formulation
Low-Rank Adaptation (LoRA)
$SLowRL$ : Efficent Real-World Adaptation
System Overview
LoRA-PPO Configuration
Safety Mechanisms
Experimental Results
Experimental Setup
Safety During Fine-tuning
Sample Efficiency during Fine-tuning
The Effect of Rank ($\rho$)
Ablation Analysis
...and 6 more sections

Figures (9)

Figure 1: The $SLowRL$ architecture. A frozen main policy and a trainable LoRA adapter operate in parallel to generate the main action. A Safety Filter monitors the robot's state and selects between the main action and a conservative Recovery Policy action to ensure safe operation during fine-tuning.
Figure 2: Detailed schematic of the $SLowRL$ framework. We freeze the pre-trained policy parameters (blue dashed blocks) to retain prior knowledge from IsaacLab. To enable adaptation to the target environment, we inject low-rank trainable adapters (green blocks) in parallel to the frozen weights. The outputs are summed before the ELU activation, ensuring a safe exploration process.
Figure 3: The Unitree Go2 robot performs a dynamic jump. Using $SLowRL$ , we fine-tune a low-rank adapter on top of a frozen simulation-trained policy, achieving substantially faster training with near-zero safety violations compared to standard PPO.
Figure 4: Sim-to-Sim Sample Efficiency during Fine-tuning (Trot Task) Comparative learning curves showing mean reward over wall-clock time for the trotting task. $SLowRL$ (blue) significantly outperforms the FFT and Zero-Shot baselines, achieving a $38\%$ reduction in time-to-convergence while maintaining higher stability.
Figure 5: Sim-to-Sim Sample Efficiency during Fine-tuning (Jump Task) Performance comparison for the dynamic jumping task on the Unitree Go2 robot. $SLowRL$ demonstrates superior sample efficiency, reducing convergence time by $55\%$ compared to baselines without incurring the safety violations observed in FFT.
...and 4 more figures

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Abstract

SLowRL: Safe Low-Rank Adaptation Reinforcement Learning for Locomotion

Authors

Abstract

Table of Contents

Figures (9)