Table of Contents
Fetching ...

Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms

Yuchao Li, Dimitri Bertsekas

TL;DR

This work treats transformer-based $n$-gram generation as a stationary Markov chain over $n$-word states and introduces rollout algorithms as a scalable approach to produce highly likely sequences, balancing future impact with immediate probabilities. It formalizes greedy, most-likely, and rollout policies, and proves a one-step rollout improvement property while outlining several variants (simplified, truncated, and double rollout) and their computational trade-offs. Through experiments on small Markov chains and a GPT-derived chain, the paper demonstrates that rollout approaches substantially outperform greedy decoding and achieve near-optimal performance with manageable overhead, with further gains from increased lookahead and rollout iterations. The results suggest rollout methods as a practical tool for enhancing sequence generation in large-scale language models and related HMM inference tasks, while highlighting nuances in when longer lookahead helps.

Abstract

In this paper we consider a transformer with an $n$-gram structure, such as the one underlying ChatGPT. The transformer provides next word probabilities, which can be used to generate word sequences. We consider methods for computing word sequences that are highly likely, based on these probabilities. Computing the optimal (i.e., most likely) word sequence starting with a given initial state is an intractable problem, so we propose methods to compute highly likely sequences of $N$ words in time that is a low order polynomial in $N$ and in the vocabulary size of the $n$-gram. These methods are based on the rollout approach from approximate dynamic programming, a form of single policy iteration, which can improve the performance of any given heuristic policy. In our case we use a greedy heuristic that generates as next word one that has the highest probability. We show with analysis, examples, and computational experimentation that our methods are capable of generating highly likely sequences with a modest increase in computation over the greedy heuristic. While our analysis and experiments are focused on Markov chains of the type arising in transformer and ChatGPT-like models, our methods apply to general finite-state Markov chains, and related inference applications of Hidden Markov Models (HMM), where Viterbi decoding is used extensively.

Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms

TL;DR

This work treats transformer-based -gram generation as a stationary Markov chain over -word states and introduces rollout algorithms as a scalable approach to produce highly likely sequences, balancing future impact with immediate probabilities. It formalizes greedy, most-likely, and rollout policies, and proves a one-step rollout improvement property while outlining several variants (simplified, truncated, and double rollout) and their computational trade-offs. Through experiments on small Markov chains and a GPT-derived chain, the paper demonstrates that rollout approaches substantially outperform greedy decoding and achieve near-optimal performance with manageable overhead, with further gains from increased lookahead and rollout iterations. The results suggest rollout methods as a practical tool for enhancing sequence generation in large-scale language models and related HMM inference tasks, while highlighting nuances in when longer lookahead helps.

Abstract

In this paper we consider a transformer with an -gram structure, such as the one underlying ChatGPT. The transformer provides next word probabilities, which can be used to generate word sequences. We consider methods for computing word sequences that are highly likely, based on these probabilities. Computing the optimal (i.e., most likely) word sequence starting with a given initial state is an intractable problem, so we propose methods to compute highly likely sequences of words in time that is a low order polynomial in and in the vocabulary size of the -gram. These methods are based on the rollout approach from approximate dynamic programming, a form of single policy iteration, which can improve the performance of any given heuristic policy. In our case we use a greedy heuristic that generates as next word one that has the highest probability. We show with analysis, examples, and computational experimentation that our methods are capable of generating highly likely sequences with a modest increase in computation over the greedy heuristic. While our analysis and experiments are focused on Markov chains of the type arising in transformer and ChatGPT-like models, our methods apply to general finite-state Markov chains, and related inference applications of Hidden Markov Models (HMM), where Viterbi decoding is used extensively.
Paper Structure (8 sections, 25 equations, 11 figures)

This paper contains 8 sections, 25 equations, 11 figures.

Figures (11)

  • Figure 1: Schematic visualization of an $n$-gram. Given the ($n$-word) text string $x_k$ generated at time $k$, it generates the next ($n$-word) text string $x_{k+1}$ by adding a word at the front end of $x_k$, and deleting the word at the back end of $x_k$.
  • Figure 2: Illustration of the state trajectory generated by a policy $\pi$, starting at state $x$ at time $k$. The probability of its occurrence, $P_k(x,\pi)$, is the product of the transition probabilities along the $N-k$ steps of the trajectory [cf. Eq. (\ref{['multrule']})].
  • Figure 3: Schematic illustration of the rollout policy with one-step lookahead. At the current state $x_k$, we compute the Q-factors $Q_{\overline{\pi},k}(x_k,y)=p(y\ |\ x_k)P_{k+1}(y,\overline{\pi})$by running the greedy selection policy from all possible next states $y$. We then select as next state $x_{k+1}$ the one with maximal Q-factor.
  • Figure 4: Illustration of $\ell$-step lookahead rollout with $\ell=2$. At the current state $x_k$, we maximize over all pairs $\{y_1,y_2\}$, the $\ell$-step Q-factor $Q_{\overline{\pi},k,\ell}(x_k,y_1,y_2)=p(y_1\mid x_k)p(y_2\mid y_1)P_{k+\ell}(y_2,\overline{\pi});$cf. Eq. (\ref{['elllook']}); the figure illustrates the case $\ell=2$. If $\{\tilde{y}_1,\tilde{y}_2\}$ is the maximizing sequence, we select $\tilde{y}_1$ and discard $\tilde{y}_2$.
  • Figure 5: A two-state Markov chain example with transition probabilities as shown next to the transition arcs (the transition not shown in the graph has probability 0). We assume that $x_0=1$, $p>1/2$, and $N$ is even.
  • ...and 6 more figures