Table of Contents
Fetching ...

Learning When to Cooperate Under Heterogeneous Goals

Max Taylor-Davies, Neil Bramley, Christopher G. Lucas

TL;DR

A novel approach to learning policies in this setting is introduced, based on a hierarchical combination of imitation and reinforcement learning, and it is shown that it outperforms baseline methods across extended versions of two cooperative environments.

Abstract

A significant element of human cooperative intelligence lies in our ability to identify opportunities for fruitful collaboration; and conversely to recognise when the task at hand is better pursued alone. Research on flexible cooperation in machines has left this meta-level problem largely unexplored, despite its importance for successful collaboration in heterogeneous open-ended environments. Here, we extend the typical Ad Hoc Teamwork (AHT) setting to incorporate the idea of agents having heterogeneous goals that in any given scenario may or may not overlap. We introduce a novel approach to learning policies in this setting, based on a hierarchical combination of imitation and reinforcement learning, and show that it outperforms baseline methods across extended versions of two cooperative environments. We also investigate the contribution of an auxiliary component that learns to model teammates by predicting their actions, finding that its effect on performance is inversely related to the amount of observable information about teammate goals.

Learning When to Cooperate Under Heterogeneous Goals

TL;DR

A novel approach to learning policies in this setting is introduced, based on a hierarchical combination of imitation and reinforcement learning, and it is shown that it outperforms baseline methods across extended versions of two cooperative environments.

Abstract

A significant element of human cooperative intelligence lies in our ability to identify opportunities for fruitful collaboration; and conversely to recognise when the task at hand is better pursued alone. Research on flexible cooperation in machines has left this meta-level problem largely unexplored, despite its importance for successful collaboration in heterogeneous open-ended environments. Here, we extend the typical Ad Hoc Teamwork (AHT) setting to incorporate the idea of agents having heterogeneous goals that in any given scenario may or may not overlap. We introduce a novel approach to learning policies in this setting, based on a hierarchical combination of imitation and reinforcement learning, and show that it outperforms baseline methods across extended versions of two cooperative environments. We also investigate the contribution of an auxiliary component that learns to model teammates by predicting their actions, finding that its effect on performance is inversely related to the amount of observable information about teammate goals.
Paper Structure (19 sections, 2 equations, 7 figures, 3 tables)

This paper contains 19 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration of the goal space under three different scenarios, with the set of 'worthwhile' (rewarding and in-principle achievable) goals highlighted in green. $\mathcal{G}^\text{ego}$ denotes the set of goals that would produce reward for the ego agent; $\mathcal{G}^\text{teammates}$ the set of goals that would produce reward for at least one teammate; and $\mathcal{G}^\text{solo}$ the set of goals that can be achieved through the effort of a single agent.
  • Figure 2: (A) The hierarchical architecture of GRILL (B) The encoder-decoder architecture optimised offline in stage 1, from which the action decoder becomes the low-level policy $\pi_\text{action}$ in stage 2 (C) The auxiliary modelling component used in GRILL-M (but not GRILL)
  • Figure 3: Example frames from the two AHT environments we extend.
  • Figure 4: Top: evaluation returns relative to oracle policy, measured over 1000 episodes $\times$ the 3 scenarios $\times$ 20 independent training runs. Mean values are given by the horizontal markers, while underlaid violin plots show the KDE. Agents were trained for 1e7 and 5e7 timesteps on CR and LBF respectively. Bottom: from the same set of evaluation episodes, the difference in proportion of non-solo goals pursued between the full- and no-overlap scenarios.
  • Figure 5: The distribution of goals attempted by the ego agent during evaluation. For CR, 'attempt' means the agent occupied one of the goal tiles; for LBF, it means the agent used the 'collect' action while adjacent to a fruit.
  • ...and 2 more figures