Table of Contents
Fetching ...

OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs

Shaoyuan Chen, Zhixuan Chen, Dawei Yang, Zhihang Yuan, Qiang Wu

TL;DR

OTARo tackles the rigidity of fixed-precision quantization for on-device LLMs by enabling a single model to operate across multiple quantization bit-widths. It achieves this with SEFP, which shares exponents across parameter groups and uses mantissa truncation to realize different bit-widths, coupled with a training objective that optimizes losses across bit-widths. The approach introduces Exploitation-Exploration Bit-Width Path Search (BPS) to select bit-width sequences and Low-Precision Asynchronous Accumulation (LAA) to stabilize updates under low precision. Empirical results on multiple LLMs show robust performance across precisions, with substantial memory and speed benefits for on-device deployment.

Abstract

Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.

OTARo: Once Tuning for All Precisions toward Robust On-Device LLMs

TL;DR

OTARo tackles the rigidity of fixed-precision quantization for on-device LLMs by enabling a single model to operate across multiple quantization bit-widths. It achieves this with SEFP, which shares exponents across parameter groups and uses mantissa truncation to realize different bit-widths, coupled with a training objective that optimizes losses across bit-widths. The approach introduces Exploitation-Exploration Bit-Width Path Search (BPS) to select bit-width sequences and Low-Precision Asynchronous Accumulation (LAA) to stabilize updates under low precision. Empirical results on multiple LLMs show robust performance across precisions, with substantial memory and speed benefits for on-device deployment.

Abstract

Large Language Models (LLMs) fine-tuning techniques not only improve the adaptability to diverse downstream tasks, but also mitigate adverse effects of model quantization. Despite this, conventional quantization suffers from its structural limitation that hinders flexibility during the fine-tuning and deployment stages. Practical on-device tasks demand different quantization precisions (i.e. different bit-widths), e.g., understanding tasks tend to exhibit higher tolerance to reduced precision compared to generation tasks. Conventional quantization, typically relying on scaling factors that are incompatible across bit-widths, fails to support the on-device switching of precisions when confronted with complex real-world scenarios. To overcome the dilemma, we propose OTARo, a novel method that enables on-device LLMs to flexibly switch quantization precisions while maintaining performance robustness through once fine-tuning. OTARo introduces Shared Exponent Floating Point (SEFP), a distinct quantization mechanism, to produce different bit-widths through simple mantissa truncations of a single model. Moreover, to achieve bit-width robustness in downstream applications, OTARo performs a learning process toward losses induced by different bit-widths. The method involves two critical strategies: (1) Exploitation-Exploration Bit-Width Path Search (BPS), which iteratively updates the search path via a designed scoring mechanism; (2) Low-Precision Asynchronous Accumulation (LAA), which performs asynchronous gradient accumulations and delayed updates under low bit-widths. Experiments on popular LLMs, e.g., LLaMA3.2-1B, LLaMA3-8B, demonstrate that OTARo achieves consistently strong and robust performance for all precisions.

Paper Structure

This paper contains 24 sections, 22 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: A comparison of conventional quantization and Shared Exponent Floating Point (SEFP) in supporting dynamic precision switching. Gray arrows represent quantization, while red arrows indicate cross-precision conversion. M8, M6, and M4 denote mantissa bit-widths in SEFP.
  • Figure 2: A simple illustration of SEFP quantization (e.g., from FP16 to E5M8) of a group of parameters. Step 1 shows the exponent alignment and mantissa right-shifting for FP16 values. Step 2 shows the forced mantissa truncation.
  • Figure 3: A comparison of uniform sampling, BPS sampling and fixed precision fine-tuning. We report perplexity changes of the first two approaches relative to the last one. In the experiments, LLaMA3.2-1B is fine-tuned and evaluated on the WikiText2 train and test set, respectively.
  • Figure 4: Cosine similarities between the gradients produced by different bit-widths of the LLaMA3.2-1B layer-15 q/k/v/down projector. There exists a certain degree of similarity between gradients under different bit-widths. Furthermore, the gradient at each bit-width tends to exhibit stronger similarity with that of its higher bit-width. For example, for q projector, the cosine similarity of gradients between E5M5 and E5M8/E5M7/E5M6 is 0.97, whereas the cosine similarity with E5M4 and E5M3 decreases to 0.86 and 0.72, respectively.
  • Figure 5: The errors of gradient norms $||\nabla_{\text{sefp}}||-||\nabla_{\text{fp}}||$ under different SEFP bit-widths. Gradients are calculated on LLaMA3.2-1B layer-15 down projector.
  • ...and 4 more figures