Table of Contents
Fetching ...

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Chanwoo Park, Xiangyu Liu, Asuman Ozdaglar, Kaiqing Zhang

TL;DR

This work quantifies how large language model (LLM) agents perform in online decision-making and multi-agent games using regret as a central metric. It provides empirical evidence that representative LLMs often exhibit sublinear regret in non-stationary online learning and in repeated games, and it offers theoretical insights linking pre-training data distributions to no-regret behavior via follow-the-perturbed-leader. The authors introduce regret-loss, an unsupervised objective that promotes no-regret behavior without optimal-action labels and prove generalization and optimization guarantees, including connections to FTRL. They also identify simple counterexamples where advanced LLMs can exhibit regret, and demonstrate that regret-loss-trained Transformers can approximate no-regret algorithms in practice. Collectively, the work advances principled evaluation and training methods for LLM agents in online and strategic settings, with implications for robust, equilibrium-aware decision-making in real-world deployments.

Abstract

Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.

Do LLM Agents Have Regret? A Case Study in Online Learning and Games

TL;DR

This work quantifies how large language model (LLM) agents perform in online decision-making and multi-agent games using regret as a central metric. It provides empirical evidence that representative LLMs often exhibit sublinear regret in non-stationary online learning and in repeated games, and it offers theoretical insights linking pre-training data distributions to no-regret behavior via follow-the-perturbed-leader. The authors introduce regret-loss, an unsupervised objective that promotes no-regret behavior without optimal-action labels and prove generalization and optimization guarantees, including connections to FTRL. They also identify simple counterexamples where advanced LLMs can exhibit regret, and demonstrate that regret-loss-trained Transformers can approximate no-regret algorithms in practice. Collectively, the work advances principled evaluation and training methods for LLM agents in online and strategic settings, with implications for robust, equilibrium-aware decision-making in real-world deployments.

Abstract

Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.
Paper Structure (116 sections, 26 theorems, 157 equations, 25 figures, 3 tables)

This paper contains 116 sections, 26 theorems, 157 equations, 25 figures, 3 tables.

Key Result

Proposition 1

($p$-value of the null hypothesis). Define the event Under the assumption that the null hypothesis $H_0$ holds, the probability of this event happening is bounded as $\mathbb{P}_{H_0}(\mathcal{E}(s, T)) \leq \frac{1}{2^{T-1}} \sum_{t = s}^{T-1} $.

Figures (25)

  • Figure 3.1: Demonstration of the prompts and interaction protocol for multi-player repeated games. A human moderator does not provide the game's payoff matrices to the LLMs. Instead, at each round, the human moderator provides each player's own payoff vector history.
  • Figure 3.2: Regret of pre-trained LLMs for online learning with full-information feedback. Notably, both commercial and open-source LLMs can achieve sublinear regret as validated by our frameworks and the comparison with FTRL/FTPL, though the performances of weaker models, GPT-3.5 and open-source ones are worse. Interestingly, the GPT-4 model can even outperform well-known no-regret learning algorithms, FTRL and FTPL.
  • Figure 3.3: Regret of pre-trained LLMs for online learning with full-information feedback, with longer horizons of $T=100$ and $T=200$. In most cases, the LLMs can achieve sublinear regret as validated by our frameworks and the comparison with FTRL/FTPL, though the performances of the weaker model, GPT-3.5, is worse.
  • Figure 3.4: Regret of pre-trained LLMs for repeated games of different sizes. In most cases, both commercial and open-source LLMs can achieve sublinear regret as validated by our frameworks and the comparison with FTRL/FTPL. We report the regret of one agent for ease of presentation.
  • Figure 3.5: (left) Regret of GPT-4 (Turbo) under the canonical counterexample for FTL hazan2016introduction. (middle, right) Failure of GPT-4 (Turbo) on two scenarios with regrettable behaviors, while Transformers trained by our new regret-loss ($N=1$) in \ref{['sec:trained-transformer']} can achieve sublinear regret.
  • ...and 20 more figures

Theorems & Definitions (41)

  • Proposition 1
  • Definition 4.1: Quantal response against multiple losses
  • Theorem 4.1: Informal: Emergence of no-regret behavior
  • Theorem 5.1
  • Theorem 5.2
  • Proposition 1
  • Definition D.1: Quantal response
  • Example 1: An example instantiating \ref{['assump:decomp']}
  • Lemma 1
  • Theorem D.1
  • ...and 31 more