Table of Contents
Fetching ...

FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions

Daniel Marta, Simon Holk, Miguel Vasco, Jens Lundell, Timon Homberger, Finn Busch, Olov Andersson, Danica Kragic, Iolanda Leite

TL;DR

FLoRA introduces a low-rank adaptation scheme to the reward function in preference-based RL to enable sample-efficient style adaptation of robotic behavior while mitigating catastrophic reward forgetting. By decomposing reward-function updates as $\Delta\psi=\psi_B\psi_A$ with rank $r$, FLoRA preserves the original reward model ($\psi_0$) and only trains a small adapter, resulting in little to no runtime overhead and improved generalization in low-data regimes. Across diverse simulation benchmarks and real-world robots, FLoRA achieves effective style adaptation with robust performance on the original task, outperforming fine-tuning and data-augmentation baselines and demonstrating practical viability in robotics. The work highlights a modular, agnostic approach to reward adaptation that scales to complex tasks and lays groundwork for future multi-style adaptation and foundation-model-like control policies in robotics.

Abstract

Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks.

FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions

TL;DR

FLoRA introduces a low-rank adaptation scheme to the reward function in preference-based RL to enable sample-efficient style adaptation of robotic behavior while mitigating catastrophic reward forgetting. By decomposing reward-function updates as with rank , FLoRA preserves the original reward model () and only trains a small adapter, resulting in little to no runtime overhead and improved generalization in low-data regimes. Across diverse simulation benchmarks and real-world robots, FLoRA achieves effective style adaptation with robust performance on the original task, outperforming fine-tuning and data-augmentation baselines and demonstrating practical viability in robotics. The work highlights a modular, agnostic approach to reward adaptation that scales to complex tasks and lays groundwork for future multi-style adaptation and foundation-model-like control policies in robotics.

Abstract

Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks.

Paper Structure

This paper contains 23 sections, 3 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Style Adaptation of Robotic Behavior. We focus on the style adaptation of pre-trained robot behavior, where the goal of the adaptation is to adjust the behavior of the agent accordingly to human preferences, while still being able to perform the original task: (a) a four-legged mobile robot is pre-trained to follow behind a human user; (b) we collect a small number of novel human preferences (e.g., following the user on his right) to train a reward model and the adapted policy of the robot, while maintaining the ability to follow the human and avoiding collisions; c) we execute the style-adapted policy following the human preferences.
  • Figure 2: The Problem of CRF. t-SNE projections of state-action pairs sampled from the Drawer-Close simulation environment (a) with associated reward values. We plot the normalized reward value (higher is preferred), from green (higher) to red (lower), predicted for each state-action pair by the true reward function (b) and by different adapted reward models (c-e). FLoRA uniquely adapts the reward model to new human preferences while preventing catastrophic forgetting.
  • Figure 3: The FLoRA framework. (a) We pre-train a reward model with parameters $\psi_0$ by interleaving policy and reward training, following any preference-based RL algorithm; (b) Given new preferences from a human user, a new set of weights, $\psi_A$ and $\psi_B$, are introduced and fine-tuned while collecting novel feedback to adapt the network to a new task; (c) During run-time, the forward pass is made by summing both the frozen weights and the reward-adapted weights.
  • Figure 4: Training Curves on Meta-World Environments. (Top) Success rate of the different methods in the original task; (Bottom) Returns of the different methods measured with the style adaptation reward function. We report the mean (solid lines) and the standard error (shaded area) of the performance, averaged over 5 randomly selected seeds. Higher is better.
  • Figure 5: Style Adaptation in Real-World Tasks. For the Spot-Follow task, we report the task success rate (being able to complete the original task), the near-collision rate (counted when Spot overrides the policy with collision avoidance), and the average final distance from the goal, averaged over 20 episodes. For the RW-Drawer-Close task, we report how far it pushed in the drawer and the velocity of the end-effector, averaged over 6 episodes. The arrows indicate the direction of improvement.
  • ...and 3 more figures