Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

Hanzhao Wang; Yu Pan; Fupeng Sun; Shang Liu; Kalyan Talluri; Guanting Chen; Xiaocheng Li

Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

Hanzhao Wang, Yu Pan, Fupeng Sun, Shang Liu, Kalyan Talluri, Guanting Chen, Xiaocheng Li

TL;DR

This paper proposes a natural solution that includes the transformer-generated action sequences in the training procedure, and it enjoys better properties both numerically and theoretically.

Abstract

In this paper, we consider the supervised pre-trained transformer for a class of sequential decision-making problems. The class of considered problems is a subset of the general formulation of reinforcement learning in that there is no transition probability matrix; though seemingly restrictive, the subset class of problems covers bandits, dynamic pricing, and newsvendor problems as special cases. Such a structure enables the use of optimal actions/decisions in the pre-training phase, and the usage also provides new insights for the training and generalization of the pre-trained transformer. We first note the training of the transformer model can be viewed as a performative prediction problem, and the existing methods and theories largely ignore or cannot resolve an out-of-distribution issue. We propose a natural solution that includes the transformer-generated action sequences in the training procedure, and it enjoys better properties both numerically and theoretically. The availability of the optimal actions in the considered tasks also allows us to analyze the properties of the pre-trained transformer as an algorithm and explains why it may lack exploration and how this can be automatically resolved. Numerically, we categorize the advantages of pre-trained transformers over the structured algorithms such as UCB and Thompson sampling into three cases: (i) it better utilizes the prior knowledge in the pre-training data; (ii) it can elegantly handle the misspecification issue suffered by the structured algorithms; (iii) for short time horizon such as $T\le50$, it behaves more greedy and enjoys much better regret than the structured algorithms designed for asymptotic optimality.

Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

TL;DR

This paper proposes a natural solution that includes the transformer-generated action sequences in the training procedure, and it enjoys better properties both numerically and theoretically.

Abstract

, it behaves more greedy and enjoys much better regret than the structured algorithms designed for asymptotic optimality.

Paper Structure (51 sections, 5 theorems, 61 equations, 18 figures, 2 algorithms)

This paper contains 51 sections, 5 theorems, 61 equations, 18 figures, 2 algorithms.

Introduction
Problem Setup
Environment and performance metrics
Supervised pre-training
Pre-training and Generalization
Learned Decision Function as an Algorithm
Numerical Experiments and Discussions
Conclusion
Related Works
Literature Review
Understanding transformer and in-context learning.
Pre-trained transformer for RL.
Sequential decision-making.
$\texttt{TF}_{\hat{\theta}}$ v.s. Online Decision Transformer
Problem Examples for the General Setup in Section \ref{['sec:setup']}
...and 36 more sections

Key Result

Proposition 3.2

Suppose $\hat{\theta}$ is determined by eqn:erm_theta where $\kappa n$ data sequences are from $\mathcal{P}_{\gamma,f}$ and $(1-\kappa)n$ data sequences are from $\mathcal{P}_{\gamma,\texttt{TF}_{\tilde{\theta}}}$ for some parameter $\tilde{\theta}$. For a Lipschitz and bounded loss $l$, the followi where $\mathrm{Comp}(\cdot)$ denotes some complexity measure and $W_1(\cdot,\cdot)$ is the Wasserst

Figures (18)

Figure 1: Comparison between the pre-trained transformer framework and traditional sequential decision-making methods (such as the structured algorithm of UCB and Thompson sampling). For traditional methods, the decision-making agent (or policy) interacts directly with a single environment sampled from the real environment distribution, focusing on exploration and exploitation within that specific environment. In contrast, the pre-trained transformer is trained across multiple environments sampled from a simulator distribution. During pre-training, the transformer collects trajectories and updates its parameters by interacting with diverse environments. Once pre-trained, the transformer functions works as an algorithm and can be applied effectively in the real environment just as the structured algorithms. But unlike the structured algorithms, the pre-trained transformer leverages the huge amount of pre-training data sampled from the simulation environment.
Figure 2: (a) Training dynamics. Orange: $M_0=M=130.$ Blue: $M_0=50$ and $M=130$. It shows the effectiveness of injecting/mixing the transformer-generated sequence into the training procedure. (b) A visualization of the $H_t$ with $a_{\tau}$'s in $H_t$ generated from various $\texttt{TF}_{\theta_m}.$ For each $\texttt{TF}_{\theta_m}$, we generate 30 sequences. The decision function $\texttt{Alg}^*$ is defined in the next section. We observe (i) there is a shift over time in terms of the transformer-generated action sequence, and thus the training should adaptively focus more on the recently generated sequence like the design in Algorithm \ref{['alg:SupPT']}; (ii) the action sequence gradually gets closer to the optimal decision function $\texttt{Alg}^*.$ The experiment setups are deferred to Appendix \ref{['appx:figure_detail']}.
Figure 3: The pre-trained transformer $\texttt{TF}_{\hat{\theta}}$ well matches the optimal decision function $\texttt{Alg}^*$. For both figures, we plot decision trials for $\texttt{Alg}^*$ and $\texttt{TF}_{\hat{\theta}}$. The optimal actions change over time because $X_t$'s are different for different time $t$. The experiment setup is deferred to Appendix \ref{['appx:figure_detail']}.
Figure 4: The average out-of-sample regret of $\texttt{TF}_{\hat{ \theta}}$ against benchmark algorithms (see details in Appendix \ref{['appx:benchmark']}) calculated based 100 runs. The numbers in the legend bar are the final regret at $t=100$. The shaded area in the plots indicates the $90\%$ (empirical) confidence interval for the regrets. The prior distribution $\mathcal{P}_{\gamma}$ is continuous (infinitely many possible $\gamma$). The problem dimension: number of arms for MAB =20, dimension of linear bandits = 2, dimension of $X_t$ for pricing = 6, dimension of $X_t$ for newsvendor = 4.
Figure 5: Comparison between training frameworks of Online Decision Transformer (ODT) zheng2022online and $\texttt{TF}_{\hat{\theta}}$. The online interactions of ODT is in the (single) test environment to explore a (single) good policy in the same environment, while our proposed training framework (Algorithm \ref{['alg:SupPT']}) has online interactions in multiple environments to mitigate the OOD issue in the pre-training data (as discussed in Section \ref{['sec:algo']}), and $\texttt{TF}_{\hat{\theta}}$ can handle different test environments.
...and 13 more figures

Theorems & Definitions (12)

Claim 3.1
Proposition 3.2
Proposition 4.1
Example 4.2
Proposition 4.3
Proposition 4.4
Proposition B.2: Surrogate property
proof
proof
proof
...and 2 more

Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

TL;DR

Abstract

Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (12)