A Note on LoRA
Vlad Fomenko, Han Yu, Jongho Lee, Stanley Hsieh, Weizhu Chen
TL;DR
LoRA introduces rank-$r$ low-rank adapters that express updates as $A \in \mathbb{R}^{d\times r}$ and $B \in \mathbb{R}^{r\times d}$ to add $AB$ to base projections, enabling efficient fine-tuning without full weight updates. This note clarifies design rationales (width-wise, parallel updates) and provides practical deployment guidance, including progressive placement in Transformers and scalable inference strategies. It also discusses production concerns such as checkpointing, non-merged versus merged deployment, memory considerations, and batched multi-model serving, while outlining future directions like adaptive ranks and quantization-aware training. Overall, it offers actionable insights to deploy stable, scalable LoRA-based fine-tuning across large-scale systems.
Abstract
LoRA (Low-Rank Adaptation) has emerged as a preferred method for efficiently adapting Large Language Models (LLMs) with remarkable simplicity and efficacy. This note extends the original LoRA paper by offering new perspectives that were not initially discussed and presents a series of insights for deploying LoRA at scale. Without introducing new experiments, we aim to improve the understanding and application of LoRA.
