Table of Contents
Fetching ...

TOAST: Transformer Optimization using Adaptive and Simple Transformations

Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt

TL;DR

TOAST tackles the efficiency bottleneck of large vision transformers by exploiting intra-model block redundancies to approximate entire transformer blocks with simple, training-free mappings. By aligning block outputs via a closed-form translator (often linear or identity), it bypasses intermediate layers and reduces parameters and FLOPs while maintaining or even improving downstream accuracy across models (e.g., ViT-L, DiNO-B, DEiT-S) and datasets from MNIST to ImageNet1k. The approach is validated through latent-space analyses using block-wise similarity metrics and extensive image-classification experiments, showing that a small sample (about 500) suffices to estimate the translators and that late-block approximations are especially effective. These findings open a practical path for deploying scalable foundation models on edge devices and other resource-constrained environments without additional training.

Abstract

Foundation models achieve State-of-the-Art (SOTA) performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or fine-tuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce TOAST (Transformer Optimization using Adaptive and Simple Transformations), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformation or even the identity, without any additional training. Across SOTA pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

TOAST: Transformer Optimization using Adaptive and Simple Transformations

TL;DR

TOAST tackles the efficiency bottleneck of large vision transformers by exploiting intra-model block redundancies to approximate entire transformer blocks with simple, training-free mappings. By aligning block outputs via a closed-form translator (often linear or identity), it bypasses intermediate layers and reduces parameters and FLOPs while maintaining or even improving downstream accuracy across models (e.g., ViT-L, DiNO-B, DEiT-S) and datasets from MNIST to ImageNet1k. The approach is validated through latent-space analyses using block-wise similarity metrics and extensive image-classification experiments, showing that a small sample (about 500) suffices to estimate the translators and that late-block approximations are especially effective. These findings open a practical path for deploying scalable foundation models on edge devices and other resource-constrained environments without additional training.

Abstract

Foundation models achieve State-of-the-Art (SOTA) performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or fine-tuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce TOAST (Transformer Optimization using Adaptive and Simple Transformations), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformation or even the identity, without any additional training. Across SOTA pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.
Paper Structure (36 sections, 3 equations, 14 figures, 12 tables)

This paper contains 36 sections, 3 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Framework Description. Given two latent spaces $\mathbf{X}^{(s)}$ and $\mathbf{X}^{(e)}$ corresponding to the outputs of blocks $s$ and $e$ for a random subset of 500 training samples, estimates a lightweight transformation $\mathcal{T}$ such that $\mathbf{X}^{(e)} \approx \mathcal{T}(\mathbf{X}^{(s)})$. This allows entire transformer blocks to be approximated by simple closed-form mappings (e.g., linear or identity), reducing parameters and computation without retraining.
  • Figure 2: Block Similarities. Block-by-block similarities in DiNO-B, and DEiT-S models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specific dataset. Additional results in \ref{['sec:app-similarities']}.
  • Figure 3: Approximation vs. Representation Similarity. between the last block representations of the original and the approximated model when approximating the $i^{\text{th}}$ block.
  • Figure 5: Sample Size Ablation. Classification accuracy as a function of the number of training samples used for approximating different layers of DiNO-B and DEiT-S with a linear transformation using ImageNet1k. Accuracy stabilizes after approximately 500 samples.
  • Figure 6: Block Similarities: Block-by-block similarities in ViT-T, ViT-S, DiNO-S and ViT-B models across five datasets: MNIST, F-MNIST, CIFAR-10, CIFAR-100 and ImageNet1k. Each matrix quantifies the between latent representations of different blocks, showing potential blocks for approximation. The matrices reveal that the similarity between blocks is predominantly influenced by the model rather than the specific dataset.
  • ...and 9 more figures