Exploring Model Invariance with Discrete Search for Ultra-Low-Bit Quantization
Yuqiao Wen, Yanshuai Cao, Lili Mou
TL;DR
The paper tackles the challenge of ultra-low-bit post-training quantization for large language models by introducing InvarExplore, a unified framework that jointly searches invariant transformations—permutation, scaling, and rotation—within Transformer blocks. It employs an activation-guided discrete hill-climbing objective, L, to optimize quantization performance without retraining, where $\mathcal{L}(\mathbf X, \operatorname{quant}(\bm\theta)) = \operatorname{CE}(\mathbf X, \operatorname{quant}(\bm\theta)) + \alpha \operatorname{MSE}(\mathbf H, \mathbf H_0)$. The method demonstrates add-on improvements over state-of-the-art baselines (GPTQ, AWQ, OmniQuant) across language modeling and reasoning tasks for OPT models from 1.3B to 13B, particularly in the challenging 2-bit setting, while maintaining low calibration overhead. Ablation studies reveal that combining permutation, scaling, and rotation yields synergistic gains, although rotation is an approximate invariant for non-linear activations; this framework is compatible with existing quantization approaches and offers practical memory savings and broader applicability to neural quantization research.
Abstract
Large language models have been increasing in size due to their success in a wide range of applications. This calls for a pressing need to reduce memory usage to make them more accessible. Post-training quantization is a popular technique which uses fewer bits (e.g., 4--8 bits) to represent the model without retraining it. However, it remains a challenging task to perform quantization in an ultra-low-bit setup (e.g., 2 bits). In this paper, we propose InvarExplore, a unified framework that systematically explores different model invariance at the same time, allowing us to take advantage of the synergy between each type of invariance. Importantly, InvarExplore features a discrete search algorithm that enables us to explore permutation invariance, which is under-studied as it cannot be optimized with gradient-based methods. Results show that InvarExplore is compatible with existing state-of-the-art methods, achieving an add-on performance improvement over strong competing methods.
