MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

Akshat Ramachandran; Souvik Kundu; Tushar Krishna

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

Akshat Ramachandran, Souvik Kundu, Tushar Krishna

TL;DR

MicroScopiQ tackles the outsized challenge of quantizing foundational models by jointly applying Hessian-guided pruning with outlier-aware quantization, enabling outliers to be encoded with higher precision while surrounding inliers use a consistent, hardware-friendly data format. It introduces MX-FP for outliers and MX-INT for inliers, plus a novel ReCoN NoC to redistribute and coordinate outlier partial sums, all implemented on a multi-precision PE array. The approach achieves state-of-the-art quantization accuracy across LLMs, VLMs, CNNs, and SSMs, with up to 3x faster inference and 2x lower energy relative to prior methods, and EBW as low as approximately 2.36 bits on average. These results demonstrate a practical path to high-accuracy, energy-efficient FM inference on specialized accelerators and support integration in GPUs, offering broad impact for deployable, quantized models.

Abstract

Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision, retaining outliers at high precision but compromise hardware efficiency, or quantize inliers and outliers at the same precision, improving hardware efficiency at the cost of accuracy. To address this mutual exclusivity, we propose MicroScopiQ, a novel co-design technique that leverages pruning to complement outlier-aware quantization. MicroScopiQ retains outliers at higher precision while pruning a certain fraction of least important weights to distribute the additional outlier bits; ensuring high accuracy, aligned memory and hardware efficiency. We design a high-throughput, low overhead accelerator architecture composed of multi-precision INT processing elements and a network-on-chip called ReCoN that efficiently abstracts the complexity of supporting high-precision outliers. Additionally, unlike prior techniques, MicroScopiQ does not assume any locality of outlier weights, enabling applicability to a broad range of FMs. Extensive experiments across diverse quantization settings demonstrate that MicroScopiQ achieves state-of-the-art quantization accuracy, while delivering up to 3x faster inference and 2x lower energy consumption compared to existing alternatives. Code is available at: https://github.com/georgia-tech-synergy-lab/MicroScopiQ-LLM-Quantization

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

TL;DR

Abstract

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)