Implicit Strategic Optimization: Rethinking Long-Horizon Decision-Making in Adversarial Poker Environments
Boyang Xia, Weiyou Tian, Qingnan Ren, Jiaqi Huang, Jie Xiao, Shuo Lu, Kai Wang, Lynn Ai, Eric Yang, Bill Shi
TL;DR
The paper tackles long-horizon decision-making in adversarial multi-agent settings by identifying that payoffs are shaped by latent, time-evolving strategic externalities. It introduces Implicit Strategic Optimization (ISO), a prediction-aware framework that routes online learning by private context predictions and updates context-specific learners via iso-grpo, a context-conditioned optimistic method. Theoretical results show sublinear contextual regret and convergence to (approximate) coarse correlated equilibria, with dominant terms scaling with context mispredictions and within-context variation. Empirical evaluation in 6-player No-Limit Texas Hold'em and competitive Pokémon demonstrates consistent gains in long-term return over strong LLM and RL baselines, along with graceful degradation under controlled prediction noise. The work provides a principled mechanism to connect forecast quality to long-run performance, offering a scalable path for robust long-horizon decision-making in dynamic, strategic environments.
Abstract
Training large language model (LLM) agents for adversarial games is often driven by episodic objectives such as win rate. In long-horizon settings, however, payoffs are shaped by latent strategic externalities that evolve over time, so myopic optimization and variation-based regret analyses can become vacuous even when the dynamics are predictable. To solve this problem, we introduce Implicit Strategic Optimization (ISO), a prediction-aware framework in which each agent forecasts the current strategic context and uses it to update its policy online. ISO combines a Strategic Reward Model (SRM) that estimates the long-run strategic value of actions with iso-grpo, a context-conditioned optimistic learning rule. We prove sublinear contextual regret and equilibrium convergence guarantees whose dominant terms scale with the number of context mispredictions; when prediction errors are bounded, our bounds recover the static-game rates obtained when strategic externalities are known. Experiments in 6-player No-Limit Texas Hold'em and competitive Pokemon show consistent improvements in long-term return over strong LLM and RL baselines, and graceful degradation under controlled prediction noise.
