TOAST: Transformer Optimization using Adaptive and Simple Transformations
Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt
TL;DR
TOAST tackles the efficiency bottleneck of large vision transformers by exploiting intra-model block redundancies to approximate entire transformer blocks with simple, training-free mappings. By aligning block outputs via a closed-form translator (often linear or identity), it bypasses intermediate layers and reduces parameters and FLOPs while maintaining or even improving downstream accuracy across models (e.g., ViT-L, DiNO-B, DEiT-S) and datasets from MNIST to ImageNet1k. The approach is validated through latent-space analyses using block-wise similarity metrics and extensive image-classification experiments, showing that a small sample (about 500) suffices to estimate the translators and that late-block approximations are especially effective. These findings open a practical path for deploying scalable foundation models on edge devices and other resource-constrained environments without additional training.
Abstract
Foundation models achieve State-of-the-Art (SOTA) performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or fine-tuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce TOAST (Transformer Optimization using Adaptive and Simple Transformations), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformation or even the identity, without any additional training. Across SOTA pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.
