Personalization in Human-Robot Interaction through Preference-based Action Representation Learning
Ruiqi Wang, Dezhong Zhao, Dayoon Suh, Ziqin Yuan, Guohua Chen, Byung-Cheol Min
TL;DR
PbARL addresses the inefficiency of PbRL in HRI by decoupling a shared task structure from user-specific preferences and performing fine-tuning via a mutual-information based latent action space. It uses a pre-trained policy $\pi^{p}$ fixed during adaptation and learns a latent action encoder $f_\theta$ implemented as a conditional variational autoencoder to maximize $\mathcal{I}_m$ between the source domain $\mathcal{M}_s$ and target domain $\mathcal{M}_t$ in the action space, guided by a preference reward model $\hat{R}$. The training optimizes a composite loss $\mathcal{L}_{PbARL}$ that combines reconstruction, contrastive preference, KL regularization, and latent dynamics terms. Empirical results on Assistive Gym and a real-world user study (N=8) show PbARL delivers higher personalization with modest task-performance degradation, outperforming PbRL-HRI and PrefFT in most settings, and demonstrate practical feasibility for real HRI deployments.
Abstract
Preference-based reinforcement learning (PbRL) has shown significant promise for personalization in human-robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches.
