Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

Philipp Sadler; Sherzod Hakimov; David Schlangen

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

Philipp Sadler, Sherzod Hakimov, David Schlangen

TL;DR

It is shown that a standard PPO setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions, and it is found that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly.

Abstract

In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players' assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement -- which invites further research in the direction of cost-sharing in collaborative interactions.

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

TL;DR

Abstract

Paper Structure (50 sections, 7 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 50 sections, 7 equations, 7 figures, 4 tables, 1 algorithm.

Introduction
A Game for Evaluating and Learning Collaborative Multi-Agent Policies
Actions.
Effort.
Score.
Game Instance.
Evaluation.
Learning Neural Policies for Sharing the Cost of Success
Problem Formulation
Observations
Model Architecture
Learning Algorithm
Neural and Heuristic Policies
A Neural Follower (NIF)
A Heuristic Guide (HIG)
...and 35 more sections

Figures (7)

Figure 1: A guide and a follower observe the board with the pieces and the follower's gripper (the black dot). An optimal trajectory of actions for the follower would be: up (U), up, right (R), and take (T). The best strategy for the guide lies assumably in the middle (M) of the extremes (A/B) where the guide refers to a piece initially with $l_0$ and stays silent at until confirming the follower's choice with $l_T$. This strategy shares the cost for success between both.
Figure 2: An example from zarries_pentoref_2016 who found that a reference game leads to diverse language production on the guide's side. To study the aspects of cost sharing in such a collaborative interaction with neural agents, we propose CoGRIP along with a generator for virtual boards that eases the application of data-driven learning methods.
Figure 3: The general information and decision-making flow during an episode of the reference game. The guide observes a constant textual target piece descriptor $l_\text{tgt}$, the partial view $p_t$ and a peripheral overview $g_t$ of the scene. Given this, the guide chooses to produce a language action $a_t$ which could mean "silence", a word, a phrase or a sentence that gets translated into an utterance $l_t$. The follower receives the utterance $l_t$, the partial view $p_t$ and a peripheral overview $f_t$. Given this, the follower performs an action $a_t$ that results into waiting, a movement (which changes the visual state) or an attempt to take a piece. The game ends when any piece is taken or the maximal number of time-steps $T_{\text{max}}$ is reached.
Figure 4: The neural agent's recurrent model architecture includes a memory mechanism (LSTM). At each time-step the observation $o_t$ is encoded and then the resulting embedding $\tilde{x}_t$ is combined with a state representation $h_{t-1}$ of previous time-steps.
Figure 5: The relative usage of utterance categories per time-step for the guide in various pairings.
...and 2 more figures

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

TL;DR

Abstract

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (7)