Table of Contents
Fetching ...

Next-Token Prediction and Regret Minimization

Mehryar Mohri, Clayton Sanford, Jon Schneider, Kiran Vodrahalli, Yifan Wu

Abstract

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution $\mathcal{D}$ over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is $\mathcal{D}$ a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution $\mathcal{D}$ is a low-regret distribution, every distribution $\mathcal{D}$ is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past $w$ actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions $\mathcal{D}$ of opponent play that are $Θ(1)$-far from any low-regret distribution $\mathcal{D'}$ (even when $w = Ω(T)$ and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Next-Token Prediction and Regret Minimization

Abstract

We consider the question of how to employ next-token prediction algorithms in adversarial online decision-making environments. Specifically, if we train a next-token prediction model on a distribution over sequences of opponent actions, when is it the case that the induced online decision-making algorithm (by approximately best responding to the model's predictions) has low adversarial regret (i.e., when is a \emph{low-regret distribution})? For unbounded context windows (where the prediction made by the model can depend on all the actions taken by the adversary thus far), we show that although not every distribution is a low-regret distribution, every distribution is exponentially close (in TV distance) to one low-regret distribution, and hence sublinear regret can always be achieved at negligible cost to the accuracy of the original next-token prediction model. In contrast to this, for bounded context windows (where the prediction made by the model can depend only on the past actions taken by the adversary, as may be the case in modern transformer architectures), we show that there are some distributions of opponent play that are -far from any low-regret distribution (even when and such distributions exist). Finally, we complement these results by showing that the unbounded context robustification procedure can be implemented by layers of a standard transformer architecture, and provide empirical evidence that transformer models can be efficiently trained to represent these new low-regret distributions.

Paper Structure

This paper contains 55 sections, 15 theorems, 81 equations, 1 figure, 2 tables, 3 algorithms.

Key Result

Lemma 2.1

Let $D \in \Delta(\Theta^{T})$ be a distribution over sequences of $T$ states, and let $\mathcal{M}$ be a next-token prediction model that has perfectly learned the distribution $D$ ($D(\mathcal{M}) = D$). Consider the algorithm for the decision maker which sets $\pi_t = \text{BR}(\mathcal{M}(\bm{\t

Figures (1)

  • Figure 1: The regret of three transformers over $8$ ground truth distributions. The three transformers are 1) ROBUST_BERNOULLI, robustified BERNOULLI, 2) POLYAURN, and 3) BERNOULLI. The eight ground truth distributions are: a) half Ber$(2/3)$ and then half Ber$(1/3)$; b) half Ber$(1/3)$ and half Ber$(2/3)$, the same distribution that ROBUST_BERNOULLI and BERNOULLI were trained on; c) Ber$(1/3)$; d) Ber$(2/3)$; and four periodically changing distributions on the second row. The plot shows (very narrow) confidence intervals in light color.

Theorems & Definitions (27)

  • Lemma 2.1
  • Lemma 2.2
  • Lemma 2.3
  • Definition 2.4: Low-Regret Model
  • Theorem 3.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 5.1
  • proof
  • proof
  • ...and 17 more