Fast Peer Adaptation with Context-aware Exploration
Long Ma, Yuanfei Wang, Fangwei Zhong, Song-Chun Zhu, Yizhou Wang
TL;DR
The paper tackles fast adaptation to unknown peers in partially observable, long-horizon multi-agent games by introducing PACE, a framework that combines a context-aware policy with a peer-identification auxiliary task and a mutual-information-inspired exploration mechanism. By training with PPO over multiple episodes and leveraging a diverse training peer pool, PACE learns to probe peer strategies, build informative context, and respond with best-effort exploitation once confident. Empirical results across Kuhn Poker, PO-Overcooked, and Predator-Prey-W demonstrate that PACE achieves faster adaptation and higher returns than strong baselines, including robustness to sudden peer changes and insightful latent representations of peers. This approach improves robustness and efficiency of multi-agent interactions in competitive, cooperative, and mixed settings, with potential implications for human-agent collaboration and adversarial scenarios.
Abstract
Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer's strategy efficiently, as this is the prerequisite for carrying out the best response in adaptation. However, exploring the strategies of unknown peers is difficult, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i.e., to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods.
