Table of Contents
Fetching ...

Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation

Xiangyu Liu, Di Wang, Zhe Feng, Aranyak Mehta

TL;DR

Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that the embedding of a classical game-theoretical learning dynamic into LLM inference enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.

Abstract

While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline} pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emph{online} based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emph{smooth Fictitious Play (sFP)}, into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of-$N$ (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.

Scaling Inference-Time Computation via Opponent Simulation: Enabling Online Strategic Adaptation in Repeated Negotiation

TL;DR

Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that the embedding of a classical game-theoretical learning dynamic into LLM inference enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.

Abstract

While large language models (LLMs) have emerged as powerful decision-makers across a wide range of single-agent and stationary environments, fewer efforts have been devoted to settings where LLMs must engage in \emph{repeated} and \emph{strategic} interactions with unknown or dynamic opponents. In such settings, recipes built upon \emph{offline} pre-training or fine-tuning, though robust against worst-case adversaries, do not fully exploit the capability of LLMs to adapt \emph{online} based on interaction feedback. Instead, we explore the more natural perspective of scaling inference-time computation as a mechanism for adaptation, embedding the principles of a classical game-theoretical learning dynamic, \emph{smooth Fictitious Play (sFP)}, into LLM inference: (i) for belief formation, we employ an auxiliary opponent model that in-context learns to imitate the time-averaged behavior of the opponent; (ii) for best response, we advance best-of- (BoN) sampling by simulating against the opponent model. Empirical evaluations on two distinct forms of repeated negotiation games demonstrate that our method enables significant performance improvement over repeated online interaction compared to various baselines, offering a scalable and principled approach to repeated strategic decision-making without any parameter updates.
Paper Structure (37 sections, 3 theorems, 12 equations, 17 figures, 3 tables, 1 algorithm)

This paper contains 37 sections, 3 theorems, 12 equations, 17 figures, 3 tables, 1 algorithm.

Key Result

Proposition 4.1

For both of our negotiation games, in a single episode, there does not exist a policy $\pi_1^\star\in\Pi_1$ such that for any $\pi_2\in\Pi_2$, it holds $J_1(\pi_1^\star, \pi_2)=\max_{\pi_1\in\Pi_1}J_1(\pi_1, \pi_2)$. In fact, for any $\pi_1^\star\in\Pi_1$, there exists $\pi_2\in\Pi_2$ such that $J_1

Figures (17)

  • Figure 1: Overview of our proposed strategic decision-making framework for repeated interactions. At step $h$ of an ongoing episode $t$, the seller agent engages in Internal Strategic Thinking. First, an in-context opponent model $\pi_2^{oppo}$ is constructed using the interaction history to summarize the buyer's behavioral patterns. Then, the seller performs strategic brainstorming to generate diverse candidate strategies (e.g., Tit-for-tat, fair split). During opponent simulation, the seller rolls out future trajectories by predicting the buyer's responses via $\pi_2^{oppo}$. Finally, the agent evaluates the simulated rewards, and executes the best candidate action.
  • Figure 2: The pairwise normalized rewards among the $7$ kinds of prompts for the buyer-seller negotiation games. Results shown for both buyers and sellers for both starting first and starting second.
  • Figure 3: Correlation between the average normalized reward in the first $5$ episodes and the last $5$ episodes for buyer-seller negotiation games. Results are shown for all $7\times 7$ different prompt pairs.
  • Figure 4: Comparison of our method (red line) with $5$ baselines introduced in \ref{['sec:inf']}.
  • Figure 5: Comparison of buyer's performance under two seller behavior settings.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Proposition 4.1
  • Proposition 4.2
  • Remark 4.3: Connections to online adaptation
  • Theorem 4.4