Table of Contents
Fetching ...

FrameQuant: Flexible Low-Bit Quantization for Transformers

Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, Vikas Singh

TL;DR

FrameQuant introduces a fusion-frame based post-training quantization scheme that enables fractional-bit, notably two-bit, quantization of Transformer weights with limited accuracy loss. By representing weights in a redundant FF space and quantizing the transformed weights, FrameQuant leverages robustness guarantees and simple dequantization to achieve competitive performance on Vision Transformers and Large Language Models. The method yields substantial storage reductions (≈85% on average) and demonstrates superior or competitive perplexity/accuracy across ImageNet and LM benchmarks, with further gains as redundancy increases. The approach offers practical benefits for deploying large models on heterogeneous hardware, with public code to facilitate adoption and further hardware-alignment opportunities. FrameQuant also compares favorably against mixed-precision baselines and integrates smoothly with existing PTQ strategies via the Hessian-informed optimization in FF space.

Abstract

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https://github.com/vsingh-group/FrameQuant

FrameQuant: Flexible Low-Bit Quantization for Transformers

TL;DR

FrameQuant introduces a fusion-frame based post-training quantization scheme that enables fractional-bit, notably two-bit, quantization of Transformer weights with limited accuracy loss. By representing weights in a redundant FF space and quantizing the transformed weights, FrameQuant leverages robustness guarantees and simple dequantization to achieve competitive performance on Vision Transformers and Large Language Models. The method yields substantial storage reductions (≈85% on average) and demonstrates superior or competitive perplexity/accuracy across ImageNet and LM benchmarks, with further gains as redundancy increases. The approach offers practical benefits for deploying large models on heterogeneous hardware, with public code to facilitate adoption and further hardware-alignment opportunities. FrameQuant also compares favorably against mixed-precision baselines and integrates smoothly with existing PTQ strategies via the Hessian-informed optimization in FF space.

Abstract

Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https://github.com/vsingh-group/FrameQuant
Paper Structure (38 sections, 1 theorem, 24 equations, 14 figures, 15 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 24 equations, 14 figures, 15 tables, 1 algorithm.

Key Result

Theorem 9.1

KUTYNIOK200964 For the model described above, the MSE in linearly estimating the signal from its noisy projections is minimized when the Fusion Frame is tight

Figures (14)

  • Figure 1: Examples of Tight frames of $k = 4,5,...,11$ in $\mathbb{R}^2$
  • Figure 2: Illustration of standard calculation (on top) versus the corresponding calculations in FF space (bottom)
  • Figure 3: Inference for a FrameQuant quantized model.
  • Figure 4: (a) Validation accuracies of Vision Transformers on ImageNet-1K dataset. We can see FrameQuant closing the gap between the full precision model with increasing redundancy. Each dot in the plot corresponds to a model from tables 1-2 combined. (b) shows the distribution of weights in a ViT layer and the $2\sigma$ thresholds for clipping. We see that our thresholding keeps most of the mass while removing outliers.
  • Figure 5: Perplexity of models from OPT family on WikiText2 and C4 datasets. FrameQuant performs better than all other quantization methods under consideration. We can also see that the performance gap between the quantized models and the unquantized model goes down as the size of the models increases.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 2.1: Frames
  • Definition 2.2: Fusion Frames
  • Example 1
  • Definition 2.3: Tight Fusion Frames or TFF
  • Theorem 9.1