MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie; Kevin Wang; Bobby Cheng; Jianzhu Yao; Zhizhou Sha; Alexander Duffy; Yihan Xi; Hongyuan Mei; Cheston Tan; Chen Wei; Pramod Viswanath; Zhangyang Wang

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang

TL;DR

Meadows a self-play framework that optimizes inference-time context by coupling retention and exploration, and achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.

Abstract

Multi-turn, multi-agent LLM game evaluations often exhibit substantial run-to-run variance. In long-horizon interactions, small early deviations compound across turns and are amplified by multi-agent coupling. This biases win rate estimates and makes rankings unreliable across repeated tournaments. Prompt choice worsens this further by producing different effective policies. We address both instability and underperformance with MEMO (Memory-augmented MOdel context optimization), a self-play framework that optimizes inference-time context by coupling retention and exploration. Retention maintains a persistent memory bank that stores structured insights from self-play trajectories and injects them as priors during later play. Exploration runs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit rare and decisive states. Across five text-based games, MEMO raises mean win rate from 25.1% to 49.5% for GPT-4o-mini and from 20.9% to 44.3% for Qwen-2.5-7B-Instruct, using $2,000$ self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

TL;DR

Abstract

self-play games per task. Run-to-run variance also drops, giving more stable rankings across prompt variations. These results suggest that multi-agent LLM game performance and robustness have substantial room for improvement through context optimization. MEMO achieves the largest gains in negotiation and imperfect-information games, while RL remains more effective in perfect-information settings.

Paper Structure (60 sections, 3 equations, 27 figures, 13 tables, 2 algorithms)

This paper contains 60 sections, 3 equations, 27 figures, 13 tables, 2 algorithms.

Introduction
Preliminary and Problem Statement
Two-Player Multi-Turn Markov Game.
Prompt and Memory as Game Context.
Full-Context Evaluation.
The MEMO Framework
Tournament-Based Context Optimization
Context selection via game outcomes.
Context generation for the next generation.
Trajectory Reflection and Memory Bank
Trajectory reflection.
Memory bank.
Prioritized Replay
Experiment Setup
Game Environments
...and 45 more sections

Figures (27)

Figure 1: Left Run-to-run performance and stability comparison. Using GPT-4o-mini with MEMO achieves the highest mean win rate (49.5%) with the lowest RSE (6.4%). Right Learning efficiency comparison against the self-play RL baseline method Unstablebaseline. Using Qwen2.5-7B-Instruct, MEMO reaches 60% win rate on Kuhn Poker with only 2,000 games, 19$\times$ fewer than the 38,000 games required by the RL self-play baseline.
Figure 2: Three paradigms for learning in multi-agent LLM games.(a) Prompt optimization updates the system prompt each round through self-play, but game experience is not effectively retained across rounds, so strategic insights are lost across rounds. (b) Reinforcement learning (RL) updates model weights through self-play but relies on outcome rewards, requiring large sample budgets. (c) MEMO reflects on completed trajectories and accumulates reusable insights in a persistent memory bank across generations, enabling improvement without weight updates or external reward.
Figure 3: The MEMO Framework. At each optimization generation, new candidate contexts are proposed through two strategies: random proposals and memory-augmented updates. These candidates are then evaluated via self-play, and the best-performing candidates are used to update the pool for the next generation. To encourage exploration and mitigate redundant early moves, a prioritized replay module is introduced, enabling efficient search for robust prompts and priors within a single game.
Figure 4: Transferred GPT-4o-mini context benefits weaker models uniformly but yields mixed results for stronger ones. Per-game win rates with and without the learned context for Grok-4-Fast-Non-Reasoning (left) and Gemini-2.5-Flash-Lite (right).
Figure 5: Ranking sensitivity in KuhnPoker. With environment and evaluator pools fixed, five nearly equivalent prompt variants still flip pairwise outcomes and reshuffle rankings. The heatmap shows Kendall's $\tau_b$ for every pair of prompts: blue indicates similar rankings ($\tau_b \approx 1$), white indicates unstable rankings ($\tau_b \approx 0$), and orange indicates rank reversals ($\tau_b < 0$).
...and 22 more figures

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

TL;DR

Abstract

MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Authors

TL;DR

Abstract

Table of Contents

Figures (27)