Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms
Yuchao Li, Dimitri Bertsekas
TL;DR
This work treats transformer-based $n$-gram generation as a stationary Markov chain over $n$-word states and introduces rollout algorithms as a scalable approach to produce highly likely sequences, balancing future impact with immediate probabilities. It formalizes greedy, most-likely, and rollout policies, and proves a one-step rollout improvement property while outlining several variants (simplified, truncated, and double rollout) and their computational trade-offs. Through experiments on small Markov chains and a GPT-derived chain, the paper demonstrates that rollout approaches substantially outperform greedy decoding and achieve near-optimal performance with manageable overhead, with further gains from increased lookahead and rollout iterations. The results suggest rollout methods as a practical tool for enhancing sequence generation in large-scale language models and related HMM inference tasks, while highlighting nuances in when longer lookahead helps.
Abstract
In this paper we consider a transformer with an $n$-gram structure, such as the one underlying ChatGPT. The transformer provides next word probabilities, which can be used to generate word sequences. We consider methods for computing word sequences that are highly likely, based on these probabilities. Computing the optimal (i.e., most likely) word sequence starting with a given initial state is an intractable problem, so we propose methods to compute highly likely sequences of $N$ words in time that is a low order polynomial in $N$ and in the vocabulary size of the $n$-gram. These methods are based on the rollout approach from approximate dynamic programming, a form of single policy iteration, which can improve the performance of any given heuristic policy. In our case we use a greedy heuristic that generates as next word one that has the highest probability. We show with analysis, examples, and computational experimentation that our methods are capable of generating highly likely sequences with a modest increase in computation over the greedy heuristic. While our analysis and experiments are focused on Markov chains of the type arising in transformer and ChatGPT-like models, our methods apply to general finite-state Markov chains, and related inference applications of Hidden Markov Models (HMM), where Viterbi decoding is used extensively.
