Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game

Philipp Sadler; Sherzod Hakimov; David Schlangen

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game

Philipp Sadler, Sherzod Hakimov, David Schlangen

TL;DR

This work investigates how neural Guides can learn to adapt communication policies to different follower behaviors in a collaborative referential task. By framing the interaction as a reinforcement learning problem and introducing an effort-aware reward, the authors train intent-based language actions via PPO, yielding strategies that reduce verbosity while adapting to follower autonomy and confidence. The study demonstrates high task success with policies that stay silent at times, while the Guide often uses reference utterances to guide planning, and shows that follower behavior shapes the Guide’s communication patterns. The findings advance understanding of adaptable, human-ready communication in cooperative AI and point to future work on more nuanced reward structures and incremental language production.

Abstract

Albrecht and Stone (2018) state that modeling of changing behaviors remains an open problem "due to the essentially unconstrained nature of what other agents may do". In this work we evaluate the adaptability of neural artificial agents towards assumed partner behaviors in a collaborative reference game. In this game success is achieved when a knowledgeable Guide can verbally lead a Follower to the selection of a specific puzzle piece among several distractors. We frame this language grounding and coordination task as a reinforcement learning problem and measure to which extent a common reinforcement training algorithm (PPO) is able to produce neural agents (the Guides) that perform well with various heuristic Follower behaviors that vary along the dimensions of confidence and autonomy. We experiment with a learning signal that in addition to the goal condition also respects an assumed communicative effort. Our results indicate that this novel ingredient leads to communicative strategies that are less verbose (staying silent in some of the steps) and that with respect to that the Guide's strategies indeed adapt to the partner's level of confidence and autonomy.

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game

TL;DR

Abstract

Paper Structure (40 sections, 6 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 40 sections, 6 equations, 6 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Vision and language navigation.
Natural language goals in RL.
Interactive sub-goal generation in RL.
Skill learning in cooperative multi-agent RL.
The CoGRIP-GL Reference Game
Problem Formulation
Actions
Verbalization
Rewards
Observations
Tasks
The Follower Behaviors
Confidence.
...and 25 more sections

Figures (6)

Figure 1: An exemplary interaction between a Guide and a Follower that controls the gripper (the black dot). The Guide observes the scene $v_0$ and refers to a piece initially with $l_0$. The Follower has only a partial view $p_0$ (the grey box) and might go wrong. The Guide can provide further information based on the Follower's actions until a piece is selected at time step $T$. The Guide should learn that less utterances are necessary with a more autonomous and confident Follower.
Figure 2: The general information and decision-making flow of the reference game. The Guide observes $v_t$ which contains the full scene in pixel space and additionally the gripper position (4th-channel) and target piece (5th-channel). Given this, the Guide chooses an intent action $a_t$ that gets verbalized into a natural language sentence $l_t$. Then, the Follower receives the utterance $l_t$, the gripper coordinate $g_t$ and a symbolic representation of a partial view of the scene $p_t$. The hand-crafted policy updates the plan accordingly based on its given representation of the world. Finally, the Follower's next planned action (or wait) is performed with a certain chance defined by the attached confidence. The process repeats until a piece is taken or time runs out.
Figure 3: The Guide's recurrent vision network.
Figure 4: An intent's mean chance of being chosen at a step (for all learnt policies evaluated on the test split).
Figure 5: The distribution of the preference order choices for the reference action (from Figure \ref{['fig:intents']}). The preferences over position (P), shape (S) and color (C) are given to the ia for reference production.
...and 1 more figures

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game

TL;DR

Abstract

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game

Authors

TL;DR

Abstract

Table of Contents

Figures (6)