Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

Shaunak A. Mehta; Dylan P. Losey

Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

Shaunak A. Mehta, Dylan P. Losey

TL;DR

The paper addresses learning manipulation tasks from physical human interaction without predefining task-specific features. It introduces an end-to-end reward-learning framework that unifies demonstrations, corrections, and preferences by training an ensemble of neural reward models and then mapping them to robot trajectories through constrained optimization, allowing passive and active human input. The key contributions include a trajectory-deformation-based sampling of alternatives, cross-entropy losses for multiple feedback modalities, an information-theoretic active query mechanism, and trajectory optimization that respects robot kinematics. Empirical results from simulations and a user study show that the proposed method outperforms end-to-end baselines, particularly on new or unexpected tasks, and attains performance comparable to feature-based methods when features are known. The work advances practical, safe, real-time teaching of robot arms without requiring task priors, enabling flexible human-robot collaboration in manipulation tasks.

Abstract

Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human's intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human's inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human's demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

TL;DR

Abstract

Paper Structure (15 sections, 14 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 14 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Statement
Preliminaries: Learning Rewards with Known Features
Problem: Learning Arbitrary Rewards from Physical Interaction
Unifying Demonstrations, Corrections, and Preferences
Learning the Reward Model
Optimizing for Robot Trajectories
Simulation 1: Learning from Multiple Forms of Interaction
User Study: Multiple Forms of Physical Interaction
Simulation 2: Learning with Known and Unknown Features
Learning from Demonstrations and Corrections
Learning from Demonstrations and Preferences
Conclusion
Acknowledgements

Figures (8)

Figure 1: Human teaching a robot arm to assemble a chair. The robot does not have any prior information about this task, and must learn from the human's physical interactions. We recognize that these interactions can take multiple different forms, including demonstrations, corrections, and preferences. To unify each type of input under a single framework, we train a reward model to assign higher scores to the human's behavior ($\xi_R$) than to nearby alternatives ($\xi_A$). The robot then optimizes this reward model to find its desired trajectory.
Figure 2: Different types of physical feedback. (Left) Humans can convey information to robot arms by kinesthetically guiding the robot through a demonstration of the task. Demonstrations provide high-level information about the entire trajectory. (Middle) To refine a specific part of the robot's motion humans may make physical corrections. These corrections fine-tune the robot's behavior. (Right) Over repeated interactions the human will observe multiple robot trajectories. Humans can rank these trajectories (i.e., give their preferences) to indicate when the robot is making a mistake. We note that preferences are not physical --- in the sense that the human does not apply forces or torques --- but preference feedback naturally emerges when humans and robots occupy the same space and the human can physically observe the robot's behavior.
Figure 3: Generating trajectories for comparison. In this example the human moves a $2$-DoF point mass robot along a sine wave. We record the initial trajectory $\xi$, and then apply Equation (\ref{['eq:M1']}) to generate smooth perturbations $\hat{\xi}$. Our learned reward model should score $\xi$ as a better trajectory than any of the alternatives $\hat{\xi}$.
Figure 4: Experimental results for simulated humans paired with a Franka Emika robot arm. (Left) we compare different versions of our approach to state-of-the-art end-to-end learning baselines as well as a feature-based approach that combines multiple forms of feedback. (Center) $15$ simulated humans perform each task (Laptop and Table) using all the end-to-end learning algorithms. (Right) The simulated humans perform each task with a feature-based learning algorithm. We record the performance of the robot after learning from each approach in the form of regret and report the average regret and standard error. Ours (DCP) significantly outperforms all other versions of our approach ($p<.05)$. Ours (DCP) has a significantly lower average regret as compared to the end-to-end learning methods ($p<.05$) and performs at par with RRIC. We emphasize that RRIC has access to all relevant features in the environment, while Ours learns the reward function from scratch.
Figure 5: Learned trajectories and objective results from our in-person user study. (Top) Participants physically interacted with a $7$-DoF robot arm that had no prior knowledge about the tasks. The robot learned from physical interactions using our approach and imitation learning baselines that combine multiple feedback modalities. (Middle) The final trajectories the robot learned with each method. Five users taught the robot the Table task, five users taught the Proximity task, and five users taught the Cup task. During each task the robot needed to reach a goal position within the white rectangle. We trace the $xyz$ position of the robot's end-effector; within the Cup task the robot also needed to maintain specific orientations. (Bottom) The regret between the robot's learned trajectory and ideal trajectory. Lower values of regret indicate that the robot completed the task correctly, and the error bars plot standard error of the mean. Ours outperforms AIRL and Atari on the Table and Cup tasks ($p<.05$), and Ours has a lower regret than all the baselines for the Proximity task ($p<.05$).
...and 3 more figures

Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

TL;DR

Abstract

Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)