Table of Contents
Fetching ...

Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy

Seyedarmin Azizi, Mahdi Nazemi, Massoud Pedram

TL;DR

This work tackles the memory bottleneck of Vision Transformers by introducing activation-aware mixed-rank compression, which approximates each layer's weight as a sum of a principal low-rank term and a compact residual. The core methodology combines activation-aware SVD to preserve layer energy, a greedy mixed-rank allocation to meet a target compression, and layer-wise error compensation via a small residual matrix learned with a proxy dataset. Empirical results on ImageNet show substantial parameter reductions (e.g., 60% for DeiT-B) with minimal or no accuracy loss, and strong compatibility with post-training quantization. The approach offers a practical path to deploying ViTs in memory-constrained environments and potentially extends to other transformer architectures.

Abstract

As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance.

Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy

TL;DR

This work tackles the memory bottleneck of Vision Transformers by introducing activation-aware mixed-rank compression, which approximates each layer's weight as a sum of a principal low-rank term and a compact residual. The core methodology combines activation-aware SVD to preserve layer energy, a greedy mixed-rank allocation to meet a target compression, and layer-wise error compensation via a small residual matrix learned with a proxy dataset. Empirical results on ImageNet show substantial parameter reductions (e.g., 60% for DeiT-B) with minimal or no accuracy loss, and strong compatibility with post-training quantization. The approach offers a practical path to deploying ViTs in memory-constrained environments and potentially extends to other transformer architectures.

Abstract

As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance.
Paper Structure (13 sections, 16 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 13 sections, 16 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Impact of SVD-based rank reduction on energy level of different matrices in the first block of DeiT-B.
  • Figure 2: Impact of rank reduction on top-1 ImageNet accuracy
  • Figure 3: Normalized Frobenius norm of the error at the output of the AttnProj layer in DeiT-B