Training Noise Token Pruning
Mingxing Rao, Bohan Jiang, Daniel Moyer
TL;DR
Training Noise Token Pruning (TNT) tackles the efficiency challenge of vision transformers by turning discrete token dropping into a continuous noise-allocation problem during training, then applying discrete pruning at inference. Grounded in Information Bottleneck and Rate-Distortion theory, TNT learns per-token relevance via a Noise Allocator that injects Gaussian noise with amplitude tied to token importance; at test time the most relevant tokens are kept, reducing computation with minimal accuracy loss. Empirical results on ImageNet using ViT and DeiT backbones show state-of-the-art accuracy-per-GFLOP trade-offs, particularly strong in low-token regimes, with robust gains across single- and multi-layer pruning configurations. The approach also includes a redundancy-removal step that randomly partitions tokens and prunes the most similar pairs, improving efficiency without CLS-token dependence. Overall, TNT offers a practical, theoretically motivated framework for deploying efficient vision transformers in resource-constrained settings, with openly available code for reproducibility.
Abstract
In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.
