Table of Contents
Fetching ...

AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning

Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi

TL;DR

This work tackles the communication and compute challenges of distributed fine-tuning for large language models by introducing Adaptive Mixed-bit Activation Quantization (AMAQ) within a split learning and PEFT framework. AMAQ dynamically assigns per-channel activation bit-widths, guided by feature and layer importance, and gradually shifts from high to low precision using trainable gating parameters with a bit-regularization loss. Empirical results across generation and classification tasks on models like LLaMA3-8B, Qwen2.5-7B/14B, and Phi-3-Medium show AMAQ outperforms fixed-precision baselines under equal budgets, improves training stability, and maintains modest communication overhead, in both LoRA and full-finetune regimes and under all-layer quantization. The findings indicate AMAQ as a practical, scalable approach for parameter-efficient collaborative training of LLMs with minimal additional communication costs.

Abstract

Large Language Models (LLMs) are scaling rapidly, creating significant challenges for collaborative server client distributed training, particularly in terms of communication efficiency and computational overheads. To address these challenges, we implement Parameter-efficient Split Learning, which effectively balances efficiency and performance for collaborative training on low-resource devices. To reduce communication overhead in collaborative training, we introduce Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that progressively compresses activations and gradients from high precision (6 to 8 bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively allocating bit budgets across channels based on feature wise and layer wise importance using bit regularization. Under the same bit budgets, AMAQ outperforms fixed-precision approaches, delivering about 2.5% higher generation accuracy and about 1.3% better classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition, it significantly enhances training stability and reducing ultra-low bit representation collapse during the training. Experiments demonstrate that AMAQ integrates effectively into practical multi-machine collaborative training setups, offering superior inference accuracy with only a modest communication overhead for bits adaptation during training. This trade off makes AMAQ a practical and effective solution for collaborative training with minimal communication cost.

AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning

TL;DR

This work tackles the communication and compute challenges of distributed fine-tuning for large language models by introducing Adaptive Mixed-bit Activation Quantization (AMAQ) within a split learning and PEFT framework. AMAQ dynamically assigns per-channel activation bit-widths, guided by feature and layer importance, and gradually shifts from high to low precision using trainable gating parameters with a bit-regularization loss. Empirical results across generation and classification tasks on models like LLaMA3-8B, Qwen2.5-7B/14B, and Phi-3-Medium show AMAQ outperforms fixed-precision baselines under equal budgets, improves training stability, and maintains modest communication overhead, in both LoRA and full-finetune regimes and under all-layer quantization. The findings indicate AMAQ as a practical, scalable approach for parameter-efficient collaborative training of LLMs with minimal additional communication costs.

Abstract

Large Language Models (LLMs) are scaling rapidly, creating significant challenges for collaborative server client distributed training, particularly in terms of communication efficiency and computational overheads. To address these challenges, we implement Parameter-efficient Split Learning, which effectively balances efficiency and performance for collaborative training on low-resource devices. To reduce communication overhead in collaborative training, we introduce Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that progressively compresses activations and gradients from high precision (6 to 8 bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively allocating bit budgets across channels based on feature wise and layer wise importance using bit regularization. Under the same bit budgets, AMAQ outperforms fixed-precision approaches, delivering about 2.5% higher generation accuracy and about 1.3% better classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition, it significantly enhances training stability and reducing ultra-low bit representation collapse during the training. Experiments demonstrate that AMAQ integrates effectively into practical multi-machine collaborative training setups, offering superior inference accuracy with only a modest communication overhead for bits adaptation during training. This trade off makes AMAQ a practical and effective solution for collaborative training with minimal communication cost.

Paper Structure

This paper contains 26 sections, 4 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: (a) Activation quantization pipeline in our framework, where both LoRA and the learnable quantization parameters (Q) are jointly optimized during training. (b) AMAQ deployment in a split learning setup, enabling model parallelism across multiple machines. In this setting, the model can be partitioned, and activations are compressed and decompressed using the AMAQ quantizer for efficient network communication. (c) Visualization of dequantized activations using AMAQ. Unlike prior approaches, our method incorporates a trainable bit budget controlled by a bit-regularization, allowing dynamic control over the bit allocation of each channel during training.
  • Figure 2: Adaptive Mixed-bit Activation Quantization employs a learnable parameter to control the bit-width of each activation channel through a gating mechanism.
  • Figure 3: Evaluation of loss performance on GSM8K, MATH, and CodeAlpaca benchmarks for AMAQ, comparing BF16 and AQ‑SGD across various bit-width for both input and output activation quantization.
  • Figure 4: Performance of CodeAlpaca with different $\beta$
  • Figure 5: Performance and stability of Qwen2.5-7B on BoolQ under input and output activation quantization.
  • ...and 6 more figures