Table of Contents
Fetching ...

Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search

Hao Lu, Haoyuan Huang, Yulin Zhou, Chen Li, Ningxin Zhu

TL;DR

Empirical-MCTS addresses the stateless nature of inference-time reasoning in LLMs by introducing a dual-loop framework that combines local search with global, non-parametric memory to accumulate reasoning wisdom across problems. It adds Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) as a short-term reflexive optimizer and a Memory Optimization Agent for long-term policy priors, with a hybrid value-estimation scheme that blends local pairwise preferences via a Bradley-Terry model $Q_{local}(s_c) = rac{\exp(S_c)}{\exp(S_c) + \exp(S_p)}$ and global rankings via Enhanced Borda Count. The methodology leverages a UCB-based search strategy with a decay-based backpropagation rule $Q(s_p) ightarrow (1-\gamma)Q(s_p) + \gamma Q(s_c)$ to propagate insights, while the memory store is updated through atomic operations Add, Modify, Merge, and Delete to form $\mathcal{D}_{t+1} = \text{Optimizer}(\mathcal{D}_t, \pi_{mem}(\mathcal{E}_{new}, \mathcal{E}_{exist}))$. Empirical results on AIME25, ARC-AGI-2, and MathArena Apex show that Empirical-MCTS surpasses stateless baselines and memory-augmented agents, achieving a new frontier in reasoning efficiency and cost-effectiveness; qualitative analyses illustrate concrete policy evolution from generic to domain-specific constraints. The work highlights the importance of memory-guided exploration for open-ended reasoning and suggests pathways for robust consistency checks and longer-horizon planning. $\,$

Abstract

Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.

Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search

TL;DR

Empirical-MCTS addresses the stateless nature of inference-time reasoning in LLMs by introducing a dual-loop framework that combines local search with global, non-parametric memory to accumulate reasoning wisdom across problems. It adds Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) as a short-term reflexive optimizer and a Memory Optimization Agent for long-term policy priors, with a hybrid value-estimation scheme that blends local pairwise preferences via a Bradley-Terry model and global rankings via Enhanced Borda Count. The methodology leverages a UCB-based search strategy with a decay-based backpropagation rule to propagate insights, while the memory store is updated through atomic operations Add, Modify, Merge, and Delete to form . Empirical results on AIME25, ARC-AGI-2, and MathArena Apex show that Empirical-MCTS surpasses stateless baselines and memory-augmented agents, achieving a new frontier in reasoning efficiency and cost-effectiveness; qualitative analyses illustrate concrete policy evolution from generic to domain-specific constraints. The work highlights the importance of memory-guided exploration for open-ended reasoning and suggests pathways for robust consistency checks and longer-horizon planning.

Abstract

Inference-time scaling strategies, particularly Monte Carlo Tree Search (MCTS), have significantly enhanced the reasoning capabilities of Large Language Models (LLMs). However, current approaches remain predominantly stateless, discarding successful reasoning patterns after each problem instance and failing to mimic the empirical accumulation of wisdom characteristic of human problem-solving. To bridge this gap, we introduce Empirical-MCTS, a dual-loop framework that transforms stateless search into a continuous, non-parametric learning process. The framework unifies local exploration with global memory optimization through two novel mechanisms: Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP) and a Memory Optimization Agent. PE-EMP functions as a reflexive optimizer within the local search, utilizing pairwise feedback to dynamically synthesize adaptive criteria and evolve meta-prompts (system prompts) in real-time. Simultaneously, the Memory Optimization Agent manages a global repository as a dynamic policy prior, employing atomic operations to distill high-quality insights across problems. Extensive evaluations on complex reasoning benchmarks, including AIME25, ARC-AGI-2, and MathArena Apex, demonstrate that Empirical-MCTS significantly outperforms both stateless MCTS strategies and standalone experience-driven agents. These results underscore the critical necessity of coupling structured search with empirical accumulation for mastering complex, open-ended reasoning tasks.
Paper Structure (17 sections, 2 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Concrete Instantiation of Empirical-MCTS Framework.
  • Figure 2: Cost-Performance Pareto Frontier Analysis. We plot accuracy against inference cost for various models on MathArena Apex and ARC-AGI-2. The dashed blue line indicates the Pareto frontier, representing the optimal trade-off between cost and performance.
  • Figure 3: Ablation study on AIME25 using gpt-oss-120b.
  • Figure 4: Growth of empirical memory repository during search. The number of distilled experiences increases monotonically with rollout count (0 $\rightarrow$ 8), enabling progressive policy refinement across problem instances.
  • Figure 5: Performance vs. accumulated experiences on AIME25. The full framework (red) demonstrates a strong positive correlation between experience count and accuracy, while ablated variants show limited improvement (yellow). Baseline performance shown as horizontal reference line (56.7%).