Is (Selective) Round-To-Nearest Quantization All You Need?
Alex Kogan
TL;DR
This work revisits Round-to-Nearest (RTN) quantization as a practical, data-free tool for compressing large language models. By combining RTN with selective horizontal (layer-level) and vertical (module-level) quantization, and leveraging Marlin kernels for fast mixed-precision GEMM, the approach achieves competitive accuracy and higher token-generation throughput compared to calibration-data dependent methods. Key findings include RTN-8 attaining full recovery relative to FP16 baselines on many models, RTN-4 approaching recovery on extremely large models, and substantial latency improvements (up to ~37% on small batches) when paired with Marlin, especially under a hybrid selection strategy. The results demonstrate that RTN, with selective quantization and optimized kernels, offers a simple, scalable, data-free alternative for deploying quantized LLMs in practical settings, with clear directions for automation and MoE-style extensions in future work.
Abstract
Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.
