OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs
Shaoyuan Chen, Zhixuan Chen, Dawei Yang, Zhihang Yuan, Qiang Wu
TL;DR
OTARo tackles the rigidity of fixed-precision quantization for on-device LLMs by enabling a single model to operate across multiple quantization bit-widths. It achieves this with SEFP, which shares exponents across parameter groups and uses mantissa truncation to realize different bit-widths, coupled with a training objective that optimizes losses across bit-widths. The approach introduces Exploitation-Exploration Bit-Width Path Search (BPS) to select bit-width sequences and Low-Precision Asynchronous Accumulation (LAA) to stabilize updates under low precision. Empirical results on multiple LLMs show robust performance across precisions, with substantial memory and speed benefits for on-device deployment.
Abstract
Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.
