Table of Contents
Fetching ...

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

Shibo Jie, Yehui Tang, Jianyuan Guo, Zhi-Hong Deng, Kai Han, Yunhe Wang

TL;DR

This work tackles the problem of high inference and training costs in Vision Transformers caused by token redundancy and the fragility of token-compression methods when training and inference degrees mismatch. It introduces ToCom, a lightweight, plug-and-play Token Compensator built from parameter-efficient LoRA modules trained via fast self-distillation on pre-training data, designed to bridge the gap between models trained at one token-compression degree and deployed with another. By decoupling training and inference degrees, ToCom enables universal performance gains on off-the-shelf downstream models without extra training, and it generalizes across datasets, backbones, and various token-compression methods. Empirical results on over 20 downstream tasks (e.g., CIFAR100, FGVC, VTAB-1k) show consistent improvements, with notable gains up to 2.3% on CIFAR100, 1.5% on FGVC, and 2.0% on VTAB-1k, alongside inference-time speedups and training-time acceleration. The work demonstrates the practical impact of a universal compensator for token compression, broadening the applicability of ViT acceleration techniques in real-world deployments.

Abstract

Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: https://github.com/JieShibo/ToCom

Token Compensator: Altering Inference Cost of Vision Transformer without Re-Tuning

TL;DR

This work tackles the problem of high inference and training costs in Vision Transformers caused by token redundancy and the fragility of token-compression methods when training and inference degrees mismatch. It introduces ToCom, a lightweight, plug-and-play Token Compensator built from parameter-efficient LoRA modules trained via fast self-distillation on pre-training data, designed to bridge the gap between models trained at one token-compression degree and deployed with another. By decoupling training and inference degrees, ToCom enables universal performance gains on off-the-shelf downstream models without extra training, and it generalizes across datasets, backbones, and various token-compression methods. Empirical results on over 20 downstream tasks (e.g., CIFAR100, FGVC, VTAB-1k) show consistent improvements, with notable gains up to 2.3% on CIFAR100, 1.5% on FGVC, and 2.0% on VTAB-1k, alongside inference-time speedups and training-time acceleration. The work demonstrates the practical impact of a universal compensator for token compression, broadening the applicability of ViT acceleration techniques in real-world deployments.

Abstract

Token compression expedites the training and inference of Vision Transformers (ViTs) by reducing the number of the redundant tokens, e.g., pruning inattentive tokens or merging similar tokens. However, when applied to downstream tasks, these approaches suffer from significant performance drop when the compression degrees are mismatched between training and inference stages, which limits the application of token compression on off-the-shelf trained models. In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. In advance, we additionally perform a fast parameter-efficient self-distillation stage on the pre-trained models to obtain a small plugin, called Token Compensator (ToCom), which describes the gap between models across different compression degrees. During inference, ToCom can be directly inserted into any downstream off-the-shelf models with any mismatched training and inference compression degrees to acquire universal performance improvements without further training. Experiments on over 20 downstream tasks demonstrate the effectiveness of our framework. On CIFAR100, fine-grained visual classification, and VTAB-1k, ToCom can yield up to a maximum improvement of 2.3%, 1.5%, and 2.0% in the average performance of DeiT-B, respectively. Code: https://github.com/JieShibo/ToCom
Paper Structure (22 sections, 10 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Left: Previous token compression methods focus on scenario when training and inference compression degrees are consistent (① and ④), but not adequately address the performance of models when these degrees differ (② and ③). Right:Performance of token compression significantly degrades when compression degrees in training and inference are not equal. After applying our ToCom without training, the performance is recovered.
  • Figure 1: Results of model gaps transfer. We use CIFAR100 as $\mathcal{D_A}$ and FGVC tasks as $\mathcal{D_B}$. The results are evaluated with $r=16$.
  • Figure 2: Performance of ToMe on CIFAR100 and FGVC datasets. We use DeiT-B as pre-trained backbone. We report performance when source $r\in$ {0, 16, target $r$}
  • Figure 3: Training and inference throughput of DeiT-B with different $r$ of ToMe. Batch size is 128 and 256 for training and inference, respectively.
  • Figure 4: Illustration of our ToCom. ToCom is multiple groups of LoRA, which are trained with parameter-efficient self-distillation on pre-training dataset. The teacher model and student model have different token compression degrees which are sampled each step, and ToCom is plugged into student model during training.
  • ...and 6 more figures