Predictive Auxiliary Learning for Belief-based Multi-Agent Systems
Qinwei Huang, Stefan Wang, Simon Khan, Garrett Katz, Qinru Qiu
TL;DR
This work tackles instability and inefficiency in belief-based MARL under partial observability by introducing BEPAL-MAS, which augments agents with a belief decoder and auxiliary predictive tasks to anticipate unobservable information such as teammates' rewards and motions. The method combines a Graph Attention Encoder, LSTM-based hidden state updates, and a centralized loss L = L_{RL} + \lambda L_{aux}, where $L_{aux}$ consists of MSE terms predicting $\overline{b^t}$ and $\overline{p^t}$. Experimental results on Predator-Prey and Google Research Football show BEPAL improves average performance by about 16% and yields more stable convergence, with ablations confirming the positive impact of each auxiliary task and correlations between auxiliary accuracy and RL performance. The work also demonstrates transferability and manageable computation overhead, suggesting practical applicability for scalable, cooperative MARL in complex, partially observable domains. These findings indicate that belief-based auxiliary supervision can robustly enhance policy learning and coordination in multi-agent systems with imperfect information.
Abstract
The performance of multi-agent reinforcement learning (MARL) in partially observable environments depends on effectively aggregating information from observations, communications, and reward signals. While most existing multi-agent systems primarily rely on rewards as the only feedback for policy training, our research shows that introducing auxiliary predictive tasks can significantly enhance learning efficiency and stability. We propose Belief-based Predictive Auxiliary Learning (BEPAL), a framework that incorporates auxiliary training objectives to support policy optimization. BEPAL follows the centralized training with decentralized execution paradigm. Each agent learns a belief model that predicts unobservable state information, such as other agents' rewards or motion directions, alongside its policy model. By enriching hidden state representations with information that does not directly contribute to immediate reward maximization, this auxiliary learning process stabilizes MARL training and improves overall performance. We evaluate BEPAL in the predator-prey environment and Google Research Football, where it achieves an average improvement of about 16 percent in performance metrics and demonstrates more stable convergence compared to baseline methods.
