In Context Learning with Vision Transformers: Case Study
Antony Zhao, Alex Proshkin, Fergal Hennessy, Francesco Crivelli
TL;DR
The paper investigates in-context learning in Vision Transformers by extending image-space ICL findings to perception tasks, using a decoder-only transformer that processes image embeddings from CNNs or ViTs together with context pairs $(x_i, f(x_i))$ to predict $f(x)$ for new inputs. Through curriculum‑guided growth of input dimensionality and prompt size on downscaled CIFAR‑10 images, the authors study four function classes, including a linear map $f(x)=w^T x$ and nonlinear targets such as randomized CNN and ViT mappings. The results show that the in-context learner can approach pseudo‑inverse solutions for linear targets and, for nonlinear targets, can match or exceed models trained from scratch, especially in low-context regimes, with gradient‑descent baselines serving as reference points. These findings suggest that decoder‑only ICL architectures can achieve data‑efficient, few‑shot learning on simple image mappings and provide a foundation for scaling to more complex vision tasks with larger backbones and richer curricula.
Abstract
Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.
