A General and Efficient Training for Transformer via Token Expansion

Wenxuan Huang; Yunhang Shen; Jiao Xie; Baochang Zhang; Gaoqi He; Ke Li; Xing Sun; Shaohui Lin

A General and Efficient Training for Transformer via Token Expansion

Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin

TL;DR

Vision Transformers incur high training costs due to quadratic self-attention, motivating a universal acceleration approach. The authors propose Token Expansion (ToE), a token-growth mechanism with an initialization-expansion-merging pipeline that preserves intermediate feature distributions and keeps training hyper-parameters and architecture unchanged. ToE introduces a widest feature-distribution expansion and a feature-distribution merging step to retain information while gradually reaching full-token training, and it is compatible with EfficientTrain and other training frameworks. Empirically, ToE yields about 1.3× faster training (lossless or with accuracy gains) on DeiT and LV-ViT, with positive transfer and fine-tuning results, demonstrating practical impact for broad transformer use.

Abstract

The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion .

A General and Efficient Training for Transformer via Token Expansion

TL;DR

Abstract

Paper Structure (28 sections, 7 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 7 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Training Acceleration for Transformers
Training Acceleration for CNNs
Transformer pruning
Method
Preliminaries and Notations
Overview of ToE
Token Expansion
Spatial-distribution Token Initialization
Widest Feature-distribution Token Expansion
Feature-distribution Token Merging
Optimization of ToE
Experiments
Experimental Settings
...and 13 more sections

Figures (7)

Figure 1: The "initialization-expansion-merging" pipeline of proposed ToE. We take the $1$st training stage ($\delta=1$), the kept rate $r_1=2r_0=\frac{2}{3}$, the repetition step $k=1$ as example. ToE is only added after the first Transformer block to guide the token selection and usage. During training, steps (1), (2), and (3) are performed for each iteration with the reduction of token numbers. First, seed tokens are selected for token initialization through step (1). Then, the number of tokens is expanded via step (2) for token expansion. Finally, we merge the unselected token set (blue boxes) into the selected one (red boxes) with the close feature distributions in step (3) for token merging. During testing, ToE can be safely removed to generate the same Transformer architecture as the original full-token Transformer.
Figure 2: Visualization for the feature distribution of token set. We use T-SNE van2008visualizing to visualize the output token feature distributions at the first block, the tokens selected by ToE, and the output tokens after the second block. Baselines are DeiT-small trained on ImageNet-1K. ToE preserves the distribution integrity of intermediate features of the original token set across different Transformer blocks while ensuring that feature distributions are as wide as possible.
Figure 3: Validation Top-1 accuracy of DeiT-tiny and LV-ViT-T on ImageNet-1k during training with different methods.
Figure 4: Trade-off between acceleration ratio and model performance by setting different $r_1$.
Figure 5: Details of applying ToE to DeiT and LV-ViT during training. Dotted cubes denote the tokens are all-zero vectors.
...and 2 more figures

A General and Efficient Training for Transformer via Token Expansion

TL;DR

Abstract

A General and Efficient Training for Transformer via Token Expansion

Authors

TL;DR

Abstract

Table of Contents

Figures (7)