Table of Contents
Fetching ...

Personalization in Human-Robot Interaction through Preference-based Action Representation Learning

Ruiqi Wang, Dezhong Zhao, Dayoon Suh, Ziqin Yuan, Guohua Chen, Byung-Cheol Min

TL;DR

PbARL addresses the inefficiency of PbRL in HRI by decoupling a shared task structure from user-specific preferences and performing fine-tuning via a mutual-information based latent action space. It uses a pre-trained policy $\pi^{p}$ fixed during adaptation and learns a latent action encoder $f_\theta$ implemented as a conditional variational autoencoder to maximize $\mathcal{I}_m$ between the source domain $\mathcal{M}_s$ and target domain $\mathcal{M}_t$ in the action space, guided by a preference reward model $\hat{R}$. The training optimizes a composite loss $\mathcal{L}_{PbARL}$ that combines reconstruction, contrastive preference, KL regularization, and latent dynamics terms. Empirical results on Assistive Gym and a real-world user study (N=8) show PbARL delivers higher personalization with modest task-performance degradation, outperforming PbRL-HRI and PrefFT in most settings, and demonstrate practical feasibility for real HRI deployments.

Abstract

Preference-based reinforcement learning (PbRL) has shown significant promise for personalization in human-robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches.

Personalization in Human-Robot Interaction through Preference-based Action Representation Learning

TL;DR

PbARL addresses the inefficiency of PbRL in HRI by decoupling a shared task structure from user-specific preferences and performing fine-tuning via a mutual-information based latent action space. It uses a pre-trained policy fixed during adaptation and learns a latent action encoder implemented as a conditional variational autoencoder to maximize between the source domain and target domain in the action space, guided by a preference reward model . The training optimizes a composite loss that combines reconstruction, contrastive preference, KL regularization, and latent dynamics terms. Empirical results on Assistive Gym and a real-world user study (N=8) show PbARL delivers higher personalization with modest task-performance degradation, outperforming PbRL-HRI and PrefFT in most settings, and demonstrate practical feasibility for real HRI deployments.

Abstract

Preference-based reinforcement learning (PbRL) has shown significant promise for personalization in human-robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches.
Paper Structure (17 sections, 9 equations, 4 figures, 1 table)

This paper contains 17 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of our method with previous preference-based approaches for personalized adaptations. Unlike the common PbRL regime, which trains personalized policies from scratch, our method shifts toward fine-tuning to leverage human feedback more efficiently. Instead of using the preference-aligned reward model to directly adjust the pre-trained policy via RL, we employ it for an action representation task to train a mutual information encoder, preserving the pre-trained task performance while enhancing personalization.
  • Figure 2: Overview of PbARL. We train PbARL using transition tuples: current state $s$, action distribution $\mathbf{a}$, and next state $s'$, collected by testing a pre-trained robot policy $\pi^p$ in the environment. The objective is to learn a harmonized latent action space within the mutual information state encoder $f_\theta$, implemented as a conditional VAE, by collectively optimizing three losses: a reconstruction loss $\mathcal{L}^{Rec}_\theta$, a preference loss $\mathcal{L}^{Pref}_\theta$ that reflects the consistency between the original action ranking list $\overline{\mathbf{a}}^p$ and the re-ranked action list $\overline{\mathbf{a}}^{\hat{r}}$ based on scores derived from the preference-aligned reward model $\hat{R}$, and a Kullback–Leibler (KL) loss $\mathcal{L}^{KL}_\theta$ to regularize the latent space in the VAE structure. To enhance controllability and scalability in the learned latent action space, we also conduct an auxiliary task to train a latent transition model $\mathcal{T}\varsigma$, optimized via a dynamic loss $\mathcal{L}^{Dyn}_\varsigma$.
  • Figure 3: Depiction of the user study setup, where a seated participant interacts with a Jaco assistive robot arm for feeding using a spoon.
  • Figure 4: Mean rating scores of satisfaction and personalization levels for each method, as rated by participants. Results of two-sample t-tests are reported, with * indicating $p<.01$.