Table of Contents
Fetching ...

GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning

Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang

TL;DR

GSQ-Tuning tackles the problem of private, on-device fine-tuning of large language models under strict memory and power budgets by eliminating floating-point arithmetic from both forward and backward passes. It introduces Group-Shared Exponents Integer (GSE-INT), a memory-efficient, group-wise exponent-sharing quantization that enables fully integer-based training when combined with LoRA-like adapters and a quantize-then-compute-dequantize (QCD) pipeline. Through a Pareto frontier analysis of bit-widths and adapter rank, the approach achieves BF16-level accuracy with up to $1.85×$ memory savings and demonstrates superior hardware efficiency over FP8 (roughly $5×$ less power and $11×$ smaller chip area) at comparable performance. Extensive experiments on LLaMA/LLaMA2/LLaMA3 and vision-language models show robust generalization and practical deployment potential for edge devices, marking a step toward private, on-device adaptation of large models. The work provides actionable deployment guidance via Pareto-frontier plots and ablations, highlighting how to balance quantization and low-rank adaptation to fit limited-resource hardware while preserving accuracy.

Abstract

Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage. Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.

GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning

TL;DR

GSQ-Tuning tackles the problem of private, on-device fine-tuning of large language models under strict memory and power budgets by eliminating floating-point arithmetic from both forward and backward passes. It introduces Group-Shared Exponents Integer (GSE-INT), a memory-efficient, group-wise exponent-sharing quantization that enables fully integer-based training when combined with LoRA-like adapters and a quantize-then-compute-dequantize (QCD) pipeline. Through a Pareto frontier analysis of bit-widths and adapter rank, the approach achieves BF16-level accuracy with up to memory savings and demonstrates superior hardware efficiency over FP8 (roughly less power and smaller chip area) at comparable performance. Extensive experiments on LLaMA/LLaMA2/LLaMA3 and vision-language models show robust generalization and practical deployment potential for edge devices, marking a step toward private, on-device adaptation of large models. The work provides actionable deployment guidance via Pareto-frontier plots and ablations, highlighting how to balance quantization and low-rank adaptation to fit limited-resource hardware while preserving accuracy.

Abstract

Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage. Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.

Paper Structure

This paper contains 40 sections, 8 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: In each layer, the weights' magnitudes are similar. The standard deviations of weights across layers are less than $2^{-2}$ by 3-$\sigma$ (about probability 99.7%). The weights are from Vicuna-7B-v1.5.
  • Figure 2: The GSE format is memory efficient through group-shared exponent bits. Comparison between FP8 and GSE-Int8.
  • Figure 3: Dataflow of GSQ-Tuning. The weight is NF4 in full-rank branch and is FP32 in low-rank branch.
  • Figure 4: Pareto curve of accuracy-memory trade-offs. Compared to FP16, our GSQ-Tuning can reduce 1.85$\times$ memory usage while having the comparable accuracy. Detailed results are in Tab.\ref{['tab:llama2-7b']}