Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Natalia Frumkin; Dibakar Gope; Diana Marculescu

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Natalia Frumkin, Dibakar Gope, Diana Marculescu

TL;DR

Evol-Q is proposed, which uses evolutionary search to effectively traverse the non-smooth landscape and proposes using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset (1, 000 images) but also makes traversing such a highly non-smooth surface easier.

Abstract

Quantization scale and bit-width are the most important parameters when considering how to quantize a neural network. Prior work focuses on optimizing quantization scales in a global manner through gradient methods (gradient descent \& Hessian analysis). Yet, when applying perturbations to quantization scales, we observe a very jagged, highly non-smooth test loss landscape. In fact, small perturbations in quantization scale can greatly affect accuracy, yielding a $0.5-0.8\%$ accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset ($1,000$ images) but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by $10.30\%$, $0.78\%$, and $0.15\%$ for $3$-bit, $4$-bit, and $8$-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at https://github.com/enyac-group/evol-q

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

TL;DR

Abstract

accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small calibration dataset (

images) but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by

, and

for

-bit,

-bit, and

-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. Our code is available at https://github.com/enyac-group/evol-q

Paper Structure (32 sections, 6 equations, 15 figures, 10 tables, 2 algorithms)

This paper contains 32 sections, 6 equations, 15 figures, 10 tables, 2 algorithms.

Introduction
Related Work
The Evol-Q Framework
Uniform, End-to-End Quantization
Where to Perturb?
Global Search for Quantization Scales
The infoNCE Loss for Scale Search
Results
Setup
8-bit Quantization
4-bit Quantization
3-bit Quantization
Extending to Swin & LeViT Models
Analysis
The Test Loss Landscape of ViTs
...and 17 more sections

Figures (15)

Figure 1: We perturb along two basis vectors of one layer/block's quantization scales. The test loss landscape during perturbation is smooth in the CNN case (a), and highly non-smooth in the ViT case (b).
Figure 2: An overview of Evol-Q. On the left, we show one cycle completed on a single block. Each block has $C$ cycles of evolutionary search, and we perform $P$ passes over all blocks. On the right, we provide intuition for the infoNCE loss (Step 2), where we encourage similarity between the quantized and corresponding predictions while simultaneously maximizing dissimilarity between unlike predictions.
Figure 3: A comparison of the test loss landscapes for 4-bit quantized CNNs and ViTs. In \ref{['fig:cnn-loss-with-labels']}, we show how small perturbations in the $4^{th}$ convolutional layer yields a smooth test loss landscape. In \ref{['fig:vit-loss-with-labels']}, we apply perturbations to attention block #10 and the resultant loss landscape is highly non-smooth.
Figure 4: A zoomed in section of the landscape in \ref{['fig:vit-loss-with-labels']}, where we perform gradient descent and evolutionary search for three initial points. We show the solutions of evolutionary search ($\hat{X}_{evol}$ ) and gradient descent ($\hat{X}_{GD}$) after 10 iterations.
Figure 5: Loss Landscapes for the 4-bit quantized QKV, Projection, and Fully Connected (FC) layers in self-attention block #5. We perturb the the quantization scale along two basis vectors (Perturbation 1 & 2) to visualize the loss landscape. These landscapes capture a zoomed in region around the global minimum of the full landscape. The FC layers exhibit relative smoothness around the global minimum whereas the QKV & Projection layers are not easily traversible. The Projection layer is particularly difficult for gradient methods because it has 4 deep minima in close proximity to the global minimum.
...and 10 more figures

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

TL;DR

Abstract

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (15)