FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions
Daniel Marta, Simon Holk, Miguel Vasco, Jens Lundell, Timon Homberger, Finn Busch, Olov Andersson, Danica Kragic, Iolanda Leite
TL;DR
FLoRA introduces a low-rank adaptation scheme to the reward function in preference-based RL to enable sample-efficient style adaptation of robotic behavior while mitigating catastrophic reward forgetting. By decomposing reward-function updates as $\Delta\psi=\psi_B\psi_A$ with rank $r$, FLoRA preserves the original reward model ($\psi_0$) and only trains a small adapter, resulting in little to no runtime overhead and improved generalization in low-data regimes. Across diverse simulation benchmarks and real-world robots, FLoRA achieves effective style adaptation with robust performance on the original task, outperforming fine-tuning and data-augmentation baselines and demonstrating practical viability in robotics. The work highlights a modular, agnostic approach to reward adaptation that scales to complex tasks and lays groundwork for future multi-style adaptation and foundation-model-like control policies in robotics.
Abstract
Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks.
