Table of Contents
Fetching ...

Fast Peer Adaptation with Context-aware Exploration

Long Ma, Yuanfei Wang, Fangwei Zhong, Song-Chun Zhu, Yizhou Wang

TL;DR

The paper tackles fast adaptation to unknown peers in partially observable, long-horizon multi-agent games by introducing PACE, a framework that combines a context-aware policy with a peer-identification auxiliary task and a mutual-information-inspired exploration mechanism. By training with PPO over multiple episodes and leveraging a diverse training peer pool, PACE learns to probe peer strategies, build informative context, and respond with best-effort exploitation once confident. Empirical results across Kuhn Poker, PO-Overcooked, and Predator-Prey-W demonstrate that PACE achieves faster adaptation and higher returns than strong baselines, including robustness to sudden peer changes and insightful latent representations of peers. This approach improves robustness and efficiency of multi-agent interactions in competitive, cooperative, and mixed settings, with potential implications for human-agent collaboration and adversarial scenarios.

Abstract

Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer's strategy efficiently, as this is the prerequisite for carrying out the best response in adaptation. However, exploring the strategies of unknown peers is difficult, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i.e., to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.

Fast Peer Adaptation with Context-aware Exploration

TL;DR

The paper tackles fast adaptation to unknown peers in partially observable, long-horizon multi-agent games by introducing PACE, a framework that combines a context-aware policy with a peer-identification auxiliary task and a mutual-information-inspired exploration mechanism. By training with PPO over multiple episodes and leveraging a diverse training peer pool, PACE learns to probe peer strategies, build informative context, and respond with best-effort exploitation once confident. Empirical results across Kuhn Poker, PO-Overcooked, and Predator-Prey-W demonstrate that PACE achieves faster adaptation and higher returns than strong baselines, including robustness to sudden peer changes and insightful latent representations of peers. This approach improves robustness and efficiency of multi-agent interactions in competitive, cooperative, and mixed settings, with potential implications for human-agent collaboration and adversarial scenarios.

Abstract

Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer's strategy efficiently, as this is the prerequisite for carrying out the best response in adaptation. However, exploring the strategies of unknown peers is difficult, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i.e., to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.
Paper Structure (35 sections, 10 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 35 sections, 10 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: An example of fast peer adaptation with experiences from online interaction, where a mother employs her prior experiences with her baby as contextual cues to determine the appropriate item to offer and further explore the baby. In the initial encounter, having observed the baby's disinterest in the milk bottle, the mother infers that the baby is not hungry and suggests a toy as an alternative. Despite the initial unfavorable response to the teddy bear, there is a discernible improvement in the baby's reaction, ultimately leading the mother to successfully choose a toy car in their third interaction.
  • Figure 2: Illustration of PACE. The ego agent (left) is trained against a diverse pool of peers (right) during training. Conditioned on the past episodes, the ego agent proposes new actions to explore the peer or exploit the best response. The peer identification objective backpropagates to the context encoder and generates exploration reward for the policy to maximize mutual information.
  • Figure 3: Illustrations of Kuhn Poker (a), PO-Overcooked (b), and Predator-Prey-W (c). In (a), the hand of the peer agent is only revealed at showdowns (blue diamond nodes); in (b), the masked gray area indicates the unobserved area to the ego agent (the agent in the left room); in (c), the ego predator can only have full observability during contact with the watchtowers (blue circles).
  • Figure 4: The online adaptation results on Kuhn Poker (a), PO-Overcooked (b), and Predator-Prey-W(c). PACE continuously improves over the whole online adaptation process, outperforming baselines in all environments. In particular, PACE is the only agent capable of adaptation in the PO-Overcooked environment. Oracle denotes the best responses designed separately for every peer in the test pool.
  • Figure 5: The t-SNE plot of the latent embeddings produced by the PACE (a) and LILI (b) encoder in PO-Overcooked. Each color indicates a specific testing peer, while the shades of color denote the time order during adaptation.
  • ...and 4 more figures