LoQT: Low-Rank Adapters for Quantized Pretraining

Sebastian Loeschcke; Mads Toftrup; Michael J. Kastoryano; Serge Belongie; Vésteinn Snæbjarnarson

LoQT: Low-Rank Adapters for Quantized Pretraining

Sebastian Loeschcke, Mads Toftrup, Michael J. Kastoryano, Serge Belongie, Vésteinn Snæbjarnarson

TL;DR

Low-Rank Adapters for Quantized Training (LoQT), a method for efficiently training quantized models that uses gradient-based tensor factorization to initialize low-rank trainable weight matrices that are periodically merged into quantized full-rank weight matrices is proposed.

Abstract

Despite advances using low-rank adapters and quantization, pretraining of large models on consumer hardware has not been possible without model sharding, offloading during training, or per-layer gradient updates. To address these limitations, we propose Low-Rank Adapters for Quantized Training (LoQT), a method for efficiently training quantized models. LoQT uses gradient-based tensor factorization to initialize low-rank trainable weight matrices that are periodically merged into quantized full-rank weight matrices. Our approach is suitable for both pretraining and fine-tuning models. We demonstrate this for language modeling and downstream task adaptation, finding that LoQT enables efficient training of models up to 7B parameters on a 24GB GPU. We also demonstrate the feasibility of training a 13B model using per-layer gradient updates on the same hardware.

LoQT: Low-Rank Adapters for Quantized Pretraining

TL;DR

Abstract

Paper Structure (46 sections, 6 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 46 sections, 6 equations, 9 figures, 9 tables, 2 algorithms.

Introduction
Efficient Pretraining With LoQT
Background: GaLore
Low-rank Gradients as Adapters
Equivalence of Gradient Updates
Pretraining with LoRA
Quantized Training
Compensating for Quantization Errors
Experiments
Baselines
Pretraining of Generative Language Models
Memory-Efficient Finetuning
Arithmetic Reasoning on GSM8K
Continued Pretraining of Llama 7B
Memory and Throughput
...and 31 more sections

Figures (9)

Figure 1: Memory usage of Llama 13B, rank 1024. LW: per-layer gradient updates. A8bit: Adam 8bit.
Figure 2: Overview of LoQT. (1) Low-rank factors $P$ and $B$ are periodically initialized from the gradient of the dequantized model weights $\nabla W$, (2) then only $B$ is trained while $P_q$ and $W_q$ are kept quantized and frozen, over an exponentially increasing interval until $T_i$, (3) the low-rank factors are merged back into the quantized model. The process is repeated until training halts.
Figure 3: LoQT: Low Rank Adapters for Quantized Training
Figure 4: Ablation results for update intervals, error-compensation, quantization using model size 130m, and rank $256$. $W_q$: quantized $W$; $P_q$: quantized $P$; No Q: no quantization. The dynamic update interval $100+1.2^i$ grows exponentially for each step $i\in\mathbb{N}$.
Figure 5: Rank ablation for LoQT and LoQT-nq showing perplexity as a function of steps.
...and 4 more figures

Theorems & Definitions (1)

Definition 2.1: Gradient Low-rank Projection, def. 3.4 in zhao2024galore

LoQT: Low-Rank Adapters for Quantized Pretraining

TL;DR

Abstract

LoQT: Low-Rank Adapters for Quantized Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (1)