Incorporating Human Flexibility through Reward Preferences in Human-AI Teaming

Siddhant Bhambri; Mudit Verma; Upasana Biswas; Anil Murthy; Subbarao Kambhampati

Incorporating Human Flexibility through Reward Preferences in Human-AI Teaming

Siddhant Bhambri, Mudit Verma, Upasana Biswas, Anil Murthy, Subbarao Kambhampati

TL;DR

This work performs the first investigation of multi-agent PbRL by extending single-agent PbRL to the two-agent teaming settings and formulate it as a Human-AI PbRL Cooperation Game, where the RL agent queries the human-in-the-loop to elicit task objective and human's preferences on the joint team behavior are introduced.

Abstract

Preference-based Reinforcement Learning (PbRL) has made significant strides in single-agent settings, but has not been studied for multi-agent frameworks. On the other hand, modeling cooperation between multiple agents, specifically, Human-AI Teaming settings while ensuring successful task completion is a challenging problem. To this end, we perform the first investigation of multi-agent PbRL by extending single-agent PbRL to the two-agent teaming settings and formulate it as a Human-AI PbRL Cooperation Game, where the RL agent queries the human-in-the-loop to elicit task objective and human's preferences on the joint team behavior. Under this game formulation, we first introduce the notion of Human Flexibility to evaluate team performance based on if humans prefer to follow a fixed policy or adapt to the RL agent on the fly. Secondly, we study the RL agent's varying access to the human policy. We highlight a special case along these two dimensions, which we call Specified Orchestration, where the human is least flexible and agent has complete access to human policy. We motivate the need for taking Human Flexibility into account and the usefulness of Specified Orchestration through a gamified user study. We evaluate state-of-the-art PbRL algorithms for Human-AI cooperative setups through robot locomotion based domains that explicitly require forced cooperation. Our findings highlight the challenges associated with PbRL by varying Human Flexibility and agent's access to the human policy. Finally, we draw insights from our user study and empirical results, and conclude that Specified Orchestration can be seen as an upper bound PbRL performance for future research in Human-AI teaming scenarios.

Incorporating Human Flexibility through Reward Preferences in Human-AI Teaming

TL;DR

Abstract

Paper Structure (49 sections, 5 equations, 19 figures, 8 tables, 3 algorithms)

This paper contains 49 sections, 5 equations, 19 figures, 8 tables, 3 algorithms.

Introduction
Related Work
Preliminaries
PbRL for Human-AI Teaming
Human-AI PbRL Cooperation Game
Flexibility of the Human-in-the-Loop
Access to Human Policy
Specified Orchestration
Team Cooperation Human Subject Study
Motivation
Evaluation Metrics
Study Results
Adaptation and Task Success:
Time spent:
Frustration and cognitive load:
...and 34 more sections

Figures (19)

Figure 1: In this example task, a human and an AI agent need to cook a soup and deliver to the table. Note, that human can either continuously adapt to the AI agent by keeping an expectation on its actions to complete the sub-task or plating the soup in a bowl efficiently, or if possible, complete it independent of the AI agent which is less cognitively demanding but inefficient. We aim to bridge this gap by investigating Human-Flexibility in Human-AI Teaming and showcase how PbRL can be useful.
Figure 2: MA Highway domain - human and the AI agent cars are shown one behind the other (see Appendix \ref{['app:domains']}).
Figure 3: MA MuJoCo(L to R)- Cheetah, Ant, Walker, Swimmer, & Hopper. The action space for the joints in each of the domains has been split between the human and the AI agent (see Appendix \ref{['app:domains']}).
Figure 4: Learning curves on MA Highway-Right (row 1) and MA MuJoCo - Cheetah (row 2): comparing (from L to R) (a) Human Flexibility on multiple $\pi_H$, with Specified Orchestration case that assumes a single $\pi_H$ and complete access to it, (b) agent's 0% access to $\pi_H$, and (c) partial access to $\pi_H$; as measured on the episodic returns. The solid lines and shaded regions represent the mean and standard deviation, respectively, across three runs.
Figure 5: Learning curves on MA MuJoCo domains (from L to R) - Walker, Hopper, Ant, Swimmer: comparing PbRL algorithms under Specified Orchestration, as measured on the episodic returns. The solid lines and shaded regions represent the mean and standard deviation, respectively, across three runs.
...and 14 more figures

Theorems & Definitions (4)

Definition 4.1
Definition 4.2
Definition 4.3
Definition 4.4

Incorporating Human Flexibility through Reward Preferences in Human-AI Teaming

TL;DR

Abstract

Incorporating Human Flexibility through Reward Preferences in Human-AI Teaming

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (4)