A Note on LoRA

Vlad Fomenko; Han Yu; Jongho Lee; Stanley Hsieh; Weizhu Chen

A Note on LoRA

Vlad Fomenko, Han Yu, Jongho Lee, Stanley Hsieh, Weizhu Chen

TL;DR

LoRA introduces rank-$r$ low-rank adapters that express updates as $A \in \mathbb{R}^{d\times r}$ and $B \in \mathbb{R}^{r\times d}$ to add $AB$ to base projections, enabling efficient fine-tuning without full weight updates. This note clarifies design rationales (width-wise, parallel updates) and provides practical deployment guidance, including progressive placement in Transformers and scalable inference strategies. It also discusses production concerns such as checkpointing, non-merged versus merged deployment, memory considerations, and batched multi-model serving, while outlining future directions like adaptive ranks and quantization-aware training. Overall, it offers actionable insights to deploy stable, scalable LoRA-based fine-tuning across large-scale systems.

Abstract

LoRA (Low-Rank Adaptation) has emerged as a preferred method for efficiently adapting Large Language Models (LLMs) with remarkable simplicity and efficacy. This note extends the original LoRA paper by offering new perspectives that were not initially discussed and presents a series of insights for deploying LoRA at scale. Without introducing new experiments, we aim to improve the understanding and application of LoRA.

A Note on LoRA

TL;DR

LoRA introduces rank-

low-rank adapters that express updates as

and

to add

to base projections, enabling efficient fine-tuning without full weight updates. This note clarifies design rationales (width-wise, parallel updates) and provides practical deployment guidance, including progressive placement in Transformers and scalable inference strategies. It also discusses production concerns such as checkpointing, non-merged versus merged deployment, memory considerations, and batched multi-model serving, while outlining future directions like adaptive ranks and quantization-aware training. Overall, it offers actionable insights to deploy stable, scalable LoRA-based fine-tuning across large-scale systems.

Abstract

Paper Structure (9 sections)

This paper contains 9 sections.

Additional Insights
On Comparison
On Motivation
On FFN
Practical Improvements
Placement
Inference
Additional Explorations
Looking Ahead

A Note on LoRA

TL;DR

Abstract

A Note on LoRA

Authors

TL;DR

Abstract

Table of Contents