Table of Contents
Fetching ...

PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

Yajie Yu, Yue Feng

TL;DR

PolicyEvol-Agent tackles policy evolution under uncertainty by integrating Theory of Mind with memoized reflection and multifaceted belief generation. The method comprises four ToM-enhanced modules—Observation Description, Policy Evolution, Environmental/Self Belief Generation, and Plan Recommendation—that iteratively calibrate behavior from observations and memories. In Leduc Hold’em, the approach outperforms RL-based baselines and a state-of-the-art ToM agent, with ablations showing planning guidance as the strongest contributor and evolution tracking as essential for adaptation. The results indicate that dynamic guideline adjustment and grounded belief synthesis enable human-like strategic behavior in dynamic, incomplete-information environments, with practical implications for adaptive, interactive agents.

Abstract

Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.

PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind

TL;DR

PolicyEvol-Agent tackles policy evolution under uncertainty by integrating Theory of Mind with memoized reflection and multifaceted belief generation. The method comprises four ToM-enhanced modules—Observation Description, Policy Evolution, Environmental/Self Belief Generation, and Plan Recommendation—that iteratively calibrate behavior from observations and memories. In Leduc Hold’em, the approach outperforms RL-based baselines and a state-of-the-art ToM agent, with ablations showing planning guidance as the strongest contributor and evolution tracking as essential for adaptation. The results indicate that dynamic guideline adjustment and grounded belief synthesis enable human-like strategic behavior in dynamic, incomplete-information environments, with practical implications for adaptive, interactive agents.

Abstract

Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.

Paper Structure

This paper contains 28 sections, 14 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Examples of PolicyEvol-Agent’ cognitive process reacting to Opponent's Raise Action. The top part shows that the agent made a wrong decision with policy not evolving yet, while the bottom part illustrates the results of reasoning, planning and decision-making reaped from the calibrated policy. We introduce the process of policy evolution in the middle part.
  • Figure 2: Illustration of PolicyEvol-Agent with its four modules detailed. Each module is attached with its cognitive operations and an example output.
  • Figure 3: Chip gains of each ten games during the evolution process. Left figure: PolicyEvol-Agent vs. Suspicion-Agent. Right figure: PolicyEvol-Agent vs. CFR. We illustrate the average and median chip gains in orange line and gray box respectively.
  • Figure 4: Proportion of different actions taken by PolicyEvol-Agent in small blind and big blind in 50 games when fighting against Suspicion-Agent.