Table of Contents
Fetching ...

Fine-Tuning Robot Policies While Maintaining User Privacy

Benjamin A. Christie, Sagar Parekh, Dylan P. Losey

TL;DR

Each user is equipped with a unique key; this key is then used to mathematically transform the weights of the robot's network, and PRoP, a model-agnostic framework for personalized and private robot policies, is developed.

Abstract

Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches.

Fine-Tuning Robot Policies While Maintaining User Privacy

TL;DR

Each user is equipped with a unique key; this key is then used to mathematically transform the weights of the robot's network, and PRoP, a model-agnostic framework for personalized and private robot policies, is developed.

Abstract

Recent works introduce general-purpose robot policies. These policies provide a strong prior over how robots should behave -- e.g., how a robot arm should manipulate food items. But in order for robots to match an individual person's needs, users typically fine-tune these generalized policies -- e.g., showing the robot arm how to make their own preferred dinners. Importantly, during the process of personalizing robots, end-users leak data about their preferences, habits, and styles (e.g., the foods they prefer to eat). Other agents can simply roll-out the fine-tuned policy and see these personally-trained behaviors. This leads to a fundamental challenge: how can we develop robots that personalize actions while keeping learning private from external agents? We here explore this emerging topic in human-robot interaction and develop PRoP, a model-agnostic framework for personalized and private robot policies. Our core idea is to equip each user with a unique key; this key is then used to mathematically transform the weights of the robot's network. With the correct key, the robot's policy switches to match that user's preferences -- but with incorrect keys, the robot reverts to its baseline behaviors. We show the general applicability of our method across multiple model types in imitation learning, reinforcement learning, and classification tasks. PRoP is practically advantageous because it retains the architecture and behaviors of the original policy, and experimentally outperforms existing encoder-based approaches.

Paper Structure

This paper contains 12 sections, 11 equations, 6 figures.

Figures (6)

  • Figure 1: In human-robot interaction robots are often finetuned to personalize to user-specific needs. The users above have different preferences encoded in their personalized datasets. When the general model is finetuned to the user's personalized dataset, the resulting policy is not private. Any user that interacts with the finetuned policy will be able to infer the user's preferences. Instead, we propose PRoP: a method that enables private personalization of robot policies to humans. PRoP learns to associate user keys with intermediate transformations of the original policy, causing personalized and private behavior. When users do not provide a key (or provide a key not included in PRoP's training set), they receive the original policy.
  • Figure 2: Schematic diagram of PRoP. Our method for private personalization of robot policies uses a key encoder to augment the intermediate features of the neural network $\mathcal{R}_\phi$. Particularly, at the intermediate layer $i$ of the original policy $\pi^\star$, we apply an affine transformation to the features $z_i$ using Equation (\ref{['eq:augment']}). This transformation is shown in the top row. It is noted that this augmentation does not need to occur at every interstitial layer of the neural network: we find in our controlled simulations that a single application of the PRoP mechanism is sufficient for personalization. When combined with a conditional, personalized loss function (shown in bottom right), we find that PRoP outperforms baseline algorithms in terms of privacy and personalization without changing the architecture of the original policy $\pi^\star$.
  • Figure 3: A depiction of our simulated environments presented in Section \ref{['sec:sims']}. (A) In Imitation Learning, the robot should learn to take actions in the general and personalized datasets, depending on the key. (B) In Reinforcement Learning, the robot should learn to move to reach different objects in the scene depending on the key. (C) In Image Classification, the policy learns different labels depending on the key.
  • Figure 4: Results from our controlled simulations without pretrained policies. For each experiment, the performance of the methods with respect to the general objective is shown in the first row and the performance with respect to the personalized objective is shown in the second row. (A): In Imitation Learning, the objective is to minimize mean-squared error between predicted and actual actions. (B): In Reinforcement Learning, the objective is to maximize normalized reward. (C): In Image Classification, the objective is to maximize classification rate. Each column corresponds to a different key, with left corresponding to randomly sampled keys, center to the user's key with one bit flipped, and the final column to the user's key. An asterisk indicates significance ($p < 0.05$) and an arrow indicates the desired trend.
  • Figure 5: PRoP trained across multiple users with separate objectives achieves a higher personalization rate than alternatives. MLP outperforms in terms of Fallback task performance, but is unable to personalize to specific user objectives. Note that beyond $64$ users, the $x$-axis scaling is logarithmic. The performance of PRoP is consistent until about $16$ users. After this threshold, PRoP's performance decays linearly until $512$ users. Alternatives such as an MLP or CVAE exhibit exponential performance decay with respect to the number of personalized objectives. These results indicate that PRoP can be used to compress a few users' personalizations into a shared network without leaking their information and without degrading overall task performance.
  • ...and 1 more figures