Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning
Hao Ma, Shijie Wang, Zhiqiang Pu, Siyao Zhao, Xiaolin Ai
TL;DR
This work tackles aligning multi-agent reinforcement learning policies with human common sense in complex, sparse-reward tasks. It introduces V-GEPF, a hierarchical framework that combines a VLM-based generic potential function at the bottom with a vLLM-based adaptive skill selector at the top to shape rewards without changing the optimal policy, via the potential-based form $F(s,s'|l)=\gamma\phi(s'|l)-\phi(s|l)$ and $R(s_t,s_{t+1})=r_{env}(s_t,s_{t+1})+\rho F(s_t,s_{t+1}|l)$. The approach leverages CLIP embeddings for cross-modal state-instruction similarity and employs a vLLM to adaptively switch among multiple potential functions, with theoretical guarantees of policy invariance and Nash equilibrium preservation. Empirically, on Google Research Football, V-GEPF outperforms state-of-the-art baselines in final win rate and produces more human-like coordination, demonstrating practical potential for human-aligned MARL in real-world domains.
Abstract
Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.
