Table of Contents
Fetching ...

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Hao Ma, Shijie Wang, Zhiqiang Pu, Siyao Zhao, Xiaolin Ai

TL;DR

This work tackles aligning multi-agent reinforcement learning policies with human common sense in complex, sparse-reward tasks. It introduces V-GEPF, a hierarchical framework that combines a VLM-based generic potential function at the bottom with a vLLM-based adaptive skill selector at the top to shape rewards without changing the optimal policy, via the potential-based form $F(s,s'|l)=\gamma\phi(s'|l)-\phi(s|l)$ and $R(s_t,s_{t+1})=r_{env}(s_t,s_{t+1})+\rho F(s_t,s_{t+1}|l)$. The approach leverages CLIP embeddings for cross-modal state-instruction similarity and employs a vLLM to adaptively switch among multiple potential functions, with theoretical guarantees of policy invariance and Nash equilibrium preservation. Empirically, on Google Research Football, V-GEPF outperforms state-of-the-art baselines in final win rate and produces more human-like coordination, demonstrating practical potential for human-aligned MARL in real-world domains.

Abstract

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

TL;DR

This work tackles aligning multi-agent reinforcement learning policies with human common sense in complex, sparse-reward tasks. It introduces V-GEPF, a hierarchical framework that combines a VLM-based generic potential function at the bottom with a vLLM-based adaptive skill selector at the top to shape rewards without changing the optimal policy, via the potential-based form and . The approach leverages CLIP embeddings for cross-modal state-instruction similarity and employs a vLLM to adaptively switch among multiple potential functions, with theoretical guarantees of policy invariance and Nash equilibrium preservation. Empirically, on Google Research Football, V-GEPF outperforms state-of-the-art baselines in final win rate and produces more human-like coordination, demonstrating practical potential for human-aligned MARL in real-world domains.

Abstract

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

Paper Structure

This paper contains 35 sections, 24 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Multi-agent policies trained by MAPPO and HAPPO in GRF 11 vs 11 scenario. (a) MAPPO: The non-ball-holding players of the yellow team do not form reasonable formation to occupy valuable space. (b) HAPPO: The non-ball-holding players of the yellow team lack meaningful positioning and movement to provide effective passing opportunities for the ball-holding player.
  • Figure 2: Framework of the V-GEPF. The state's image representation $s_t^G$ and the human instructions $l$ are input separately into image encoder and text encoder. The cosine distance computed from the outputs is designed as a generic potential function $\phi(s_t|l)$, which is weighted and combined with the original environmental rewards to guide the learning of policies. Furthermore, the video replay of last episode, the initial human instructions, information about potential function pool, and the reflection on the records of last potential function are fed into vLLM to adaptively select appropriate generic potential function at various training phase.
  • Figure 3: Average win rate curves during training in GRF 11 vs 11 scenario.
  • Figure 4: Comparison of policy styles trained with MAPPO and MAPPO enhanced by V-GEPF.
  • Figure 5: Potential function curves during training. Six VLM-based potential functions are selected sequentially by a vLLM according to replayed videos.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Definition 1
  • proof
  • proof