PolicyEvol-Agent: Evolving Policy via Environment Perception and Self-Awareness with Theory of Mind
Yajie Yu, Yue Feng
TL;DR
PolicyEvol-Agent tackles policy evolution under uncertainty by integrating Theory of Mind with memoized reflection and multifaceted belief generation. The method comprises four ToM-enhanced modules—Observation Description, Policy Evolution, Environmental/Self Belief Generation, and Plan Recommendation—that iteratively calibrate behavior from observations and memories. In Leduc Hold’em, the approach outperforms RL-based baselines and a state-of-the-art ToM agent, with ablations showing planning guidance as the strongest contributor and evolution tracking as essential for adaptation. The results indicate that dynamic guideline adjustment and grounded belief synthesis enable human-like strategic behavior in dynamic, incomplete-information environments, with practical implications for adaptive, interactive agents.
Abstract
Multi-agents has exhibited significant intelligence in real-word simulations with Large language models (LLMs) due to the capabilities of social cognition and knowledge retrieval. However, existing research on agents equipped with effective cognition chains including reasoning, planning, decision-making and reflecting remains limited, especially in the dynamically interactive scenarios. In addition, unlike human, prompt-based responses face challenges in psychological state perception and empirical calibration during uncertain gaming process, which can inevitably lead to cognition bias. In light of above, we introduce PolicyEvol-Agent, a comprehensive LLM-empowered framework characterized by systematically acquiring intentions of others and adaptively optimizing irrational strategies for continual enhancement. Specifically, PolicyEvol-Agent first obtains reflective expertise patterns and then integrates a range of cognitive operations with Theory of Mind alongside internal and external perspectives. Simulation results, outperforming RL-based models and agent-based methods, demonstrate the superiority of PolicyEvol-Agent for final gaming victory. Moreover, the policy evolution mechanism reveals the effectiveness of dynamic guideline adjustments in both automatic and human evaluation.
