In-Context Deep Learning via Transformer Models
Weimin Wu, Maojiang Su, Jerry Yao-Chieh Hu, Zhao Song, Han Liu
TL;DR
This work tackles whether a transformer can feasibly simulate the training of a deep neural network via in-context learning (ICL). By constructing explicit ReLU- and Softmax-based transformer architectures, it shows how a transformer can perform multiple gradient-descent steps on an $N$-layer NN in-context, with rigorous approximation and convergence guarantees. The key contributions include a $(2N+4)L$-layer ReLU transformer that emulates $L$ GD steps and extends to varying input/output dimensions, plus a 4L-layer Softmax transformer with universal-approximation support for similar capabilities, all supported by detailed gradient-decomposition analyses and error bounds. Empirical results on synthetic data demonstrate that ICL can match direct training performance for 3-, 4-, and 6-layer networks, highlighting the potential of foundation-model in-context learning to perform deep learning tasks without explicit parameter updates.
Abstract
We investigate the transformer's capability to simulate the training process of deep models via in-context learning (ICL), i.e., in-context deep learning. Our key contribution is providing a positive example of using a transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a $(2N+4)L$-layer transformer capable of simulating $L$ gradient descent steps of an $N$-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.
