Table of Contents
Fetching ...

Training Noise Token Pruning

Mingxing Rao, Bohan Jiang, Daniel Moyer

TL;DR

Training Noise Token Pruning (TNT) tackles the efficiency challenge of vision transformers by turning discrete token dropping into a continuous noise-allocation problem during training, then applying discrete pruning at inference. Grounded in Information Bottleneck and Rate-Distortion theory, TNT learns per-token relevance via a Noise Allocator that injects Gaussian noise with amplitude tied to token importance; at test time the most relevant tokens are kept, reducing computation with minimal accuracy loss. Empirical results on ImageNet using ViT and DeiT backbones show state-of-the-art accuracy-per-GFLOP trade-offs, particularly strong in low-token regimes, with robust gains across single- and multi-layer pruning configurations. The approach also includes a redundancy-removal step that randomly partitions tokens and prunes the most similar pairs, improving efficiency without CLS-token dependence. Overall, TNT offers a practical, theoretically motivated framework for deploying efficient vision transformers in resource-constrained settings, with openly available code for reproducibility.

Abstract

In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.

Training Noise Token Pruning

TL;DR

Training Noise Token Pruning (TNT) tackles the efficiency challenge of vision transformers by turning discrete token dropping into a continuous noise-allocation problem during training, then applying discrete pruning at inference. Grounded in Information Bottleneck and Rate-Distortion theory, TNT learns per-token relevance via a Noise Allocator that injects Gaussian noise with amplitude tied to token importance; at test time the most relevant tokens are kept, reducing computation with minimal accuracy loss. Empirical results on ImageNet using ViT and DeiT backbones show state-of-the-art accuracy-per-GFLOP trade-offs, particularly strong in low-token regimes, with robust gains across single- and multi-layer pruning configurations. The approach also includes a redundancy-removal step that randomly partitions tokens and prunes the most similar pairs, improving efficiency without CLS-token dependence. Overall, TNT offers a practical, theoretically motivated framework for deploying efficient vision transformers in resource-constrained settings, with openly available code for reproducibility.

Abstract

In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 20 tables.

Figures (8)

  • Figure 1: Training Noise Token Pruning (TNT). Our proposed method computes a relevance term $\alpha_i$ for each token. In training (diagrammed at top), these terms dictate an amount of noise added to the token, while at test time they indicate pruning order.
  • Figure 2: Noise Allocator block architecture: the block diagrammed above is injected into pre-trained models as a pruning layer. It takes the output of the previous Transformer block as input, then computes the noise signal terms $\alpha$ using a linear layer followed by a Softmax function. During training it samples Gaussian noise conditioned on $\alpha$ for each token, then adds the noise to the token embeddings. At test time, tokens are instead dropped. This pruning method can be trained with all parameters outside the noise allocator are frozen.
  • Figure 3: Visualization of Token Pruning maps on ImageNet-1K: at left are the original images, and at each column progressing right are single layer prunings and their associated kept/dropped tokens, for layers 1-5 of the DeiT-B-Distil. model.
  • Figure 4: Single Layer Pruning results: We plot the Top-1 Accuracy in the ImageNet-1k validation set for each of the pruning methods as a function of computational efficiency, in the top row measured by GFLOPs and in the bottom row measured by throughput, for single layer pruning. The base model is DeiT-B-Distil in the first column, DeiT-S-Distil. in the second column, and ViT/16 in the third column. Note that the mean-pooled token embedding ViT in the third column has no CLS token, and thus EViT and Top-K cannot be applied to it.
  • Figure 5: Multi-layer Pruning results: We plot the Top-1 Accuracy in the ImageNet-1k validation set for each of the pruning methods as a function of computational efficiency, in the top row measured by GFLOPs and in the bottom row measured by throughput, for multi-layer pruning. The base model is DeiT-B-Distil in the first column, DeiT-S-Distil. in the second column, and ViT/16 in the third column. Note that the mean-pooled token embedding ViT in the third column has no CLS token, and thus EViT and Top-K cannot be applied to it.
  • ...and 3 more figures