Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search
Hao Lu, Haoyuan Huang, Yulin Zhou, Chen Li, Ningxin Zhu
TL;DR
Empirical-MCTS addresses the stateless nature of inference-time reasoning in LLMs by introducing a dual-loop framework that combines local search with global, non-parametric memory to accumulate reasoning wisdom across problems. It adds Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) as a short-term reflexive optimizer and a Memory Optimization Agent for long-term policy priors, with a hybrid value-estimation scheme that blends local pairwise preferences via a Bradley-Terry model $Q_{local}(s_c) = rac{\exp(S_c)}{\exp(S_c) + \exp(S_p)}$ and global rankings via Enhanced Borda Count. The methodology leverages a UCB-based search strategy with a decay-based backpropagation rule $Q(s_p) ightarrow (1-\gamma)Q(s_p) + \gamma Q(s_c)$ to propagate insights, while the memory store is updated through atomic operations Add, Modify, Merge, and Delete to form $\mathcal{D}_{t+1} = \text{Optimizer}(\mathcal{D}_t, \pi_{mem}(\mathcal{E}_{new}, \mathcal{E}_{exist}))$. Empirical results on AIME25, ARC-AGI-2, and MathArena Apex show that Empirical-MCTS surpasses stateless baselines and memory-augmented agents, achieving a new frontier in reasoning efficiency and cost-effectiveness; qualitative analyses illustrate concrete policy evolution from generic to domain-specific constraints. The work highlights the importance of memory-guided exploration for open-ended reasoning and suggests pathways for robust consistency checks and longer-horizon planning. $\,$
Abstract
Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.
