Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Minyoung Huh; Brian Cheung; Jeremy Bernstein; Phillip Isola; Pulkit Agrawal

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal

TL;DR

The paper investigates training neural networks from scratch under resource constraints by leveraging low-rank adapters (LoRA) and introduces LoRA-the-Explorer (LTE), a bi-level optimization that enables parallel updates across $N$ LoRA heads with periodic merging. Vanilla LoRA struggles to match full pre-training performance due to rank limitations, but LTE compensates by aggregating diverse, parallel low-rank updates to approximate a full-rank update while reducing synchronization and memory load, as formalized by $\hat{W} = W + \frac{s}{N} \sum_{n=1}^N B_n A_n$ and merge rule $\Delta_{lora} = \frac{1}{N} \sum \delta_{lora_n}$. Through experiments on Vision Transformers across datasets including ImageNet-1K, LTE achieves competitive pre-training performance with fewer inter-node communications, albeit requiring more training samples (e.g., ~40% more) and enabling larger, memory-efficient models on low-memory devices; ablations show that the number of LoRA heads, the rank $r$, and merge spacing $T$ critically influence convergence and overall accuracy. The work also analyzes gradient noise, LoRA head alignment, initialization strategies, and connections to federated learning and linear-mode connectivity, outlining practical open questions such as acceleration of the final training phase and dynamic head/rank selection for scalable pre-training in constrained environments.

Abstract

The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

TL;DR

LoRA heads with periodic merging. Vanilla LoRA struggles to match full pre-training performance due to rank limitations, but LTE compensates by aggregating diverse, parallel low-rank updates to approximate a full-rank update while reducing synchronization and memory load, as formalized by

and merge rule

. Through experiments on Vision Transformers across datasets including ImageNet-1K, LTE achieves competitive pre-training performance with fewer inter-node communications, albeit requiring more training samples (e.g., ~40% more) and enabling larger, memory-efficient models on low-memory devices; ablations show that the number of LoRA heads, the rank

, and merge spacing

critically influence convergence and overall accuracy. The work also analyzes gradient noise, LoRA head alignment, initialization strategies, and connections to federated learning and linear-mode connectivity, outlining practical open questions such as acceleration of the final training phase and dynamic head/rank selection for scalable pre-training in constrained environments.

Abstract

Paper Structure (39 sections, 18 equations, 21 figures, 6 tables, 1 algorithm)

This paper contains 39 sections, 18 equations, 21 figures, 6 tables, 1 algorithm.

Introduction
Principal findings and contributions:
Preliminaries
Parameter efficient adapters
Method
Motivation: Multi-head merging perspective
LoRA soup: delayed LoRA merging
LoRA-the-Explorer: parallel low-rank updates
Implementation details
Experiments
Iterative LoRA Merging
LoRA parameter alignment
Ablation study: the effect of LoRA heads, rank, and merge iteration
Gradient noise with parallel updates
Performance Scaling on ImageNet-1K
...and 24 more sections

Figures (21)

Figure 1: Lora-The-Explorer: We propose LoRA-the-explorer, an optimization algorithm that can match the performance of standard training from scratch. Our method optimizes unique LoRA parameters in parallel and merges them back to the main weights. Our algorithm can leverage lower-memory devices and only depends on communicating the LoRA parameters for training, making it an ideal candidate in a bandlimited or memory-constraint training framework.
Figure 1: LTE ablation for ViT-S trained ImageNet100: For fixed cumulative training epoch of $1200$, we vary the number of heads, rank, and merge iteration of our method. More heads require longer cumulative training samples to converge; see Figure \ref{['fig:lte_sync_steps']}.
Figure 2: Increasing the rank of LoRA can recover the standard training performance: ViT-S trained on ImageNet100 using with and without LoRA. Low-rank LoRA uses rank $r=64$, and full-rank LoRA uses rank $r=\min(m, n)$ set to the dimension of the original weight $\mathbf{W} \in \mathbb{R}^{m \times n}$. Increasing $r$ suffices to match standard training performance.
Figure 3: LTE diagram: Our method is decomposed into 3 steps. (1) We parameterize the model with multiple LoRA heads and train them independently for $T$ iterations using different mini-batches sampled from the same (homogeneous) distribution. This results in overall update of $\delta_{\mathsf{lora}_n}(\mathbf{x}) = - \eta \sum_t \nabla_{\mathsf{lora}_n}(\mathbf{x}[t])$ (2). Next, we accumulate the individual LoRA updates by averaging the heads $\Delta_{\mathsf{lora}}(\mathbf{x}) = \frac{1}{N}\sum_n \delta_{\mathsf{lora}_n}(\mathbf{x})$. (3) The update is applied to the main weights, and the LoRA parameter $\mathbf{B}$ is reset. The optimization repeats with the new LoRA parameters. LTE resembles the distributed model development paradigm first proposed by kandpal2023git.
Figure 4: Effects of merging LoRA heads.Left: We measure $l_2$-norm deviation of the effective weights of multi-head LoRA (MHLoRA) and LoRA-the-explorer (LTE, our method) from the weights of standard training using ViT-S. We use $4$ heads for both MHLoRA and LTE using the same initialization, and we measure the norm of encoder-layer-3. We also plot the individual LoRA heads of LTE. These heads deviate more from standard training, but their average closely follows that of MHLoRA. Depending on the merge iteration (x-axis), the estimation gap of using stale estimates is roughly the difference between the MHLoRA and LTE averaged. The later the merge happens, the more LTE deviates from MHLoRA. Right: We project the dynamics of MHLoRA and LTE onto the parameters of MHLoRA. The y-axis is the initial parameters, and the x-axis is after training for $25$ iterations. The projection is computed by computing the cosine similarity on the vectorized weights and creating an arc from (0, 1) to (1, 0). We set merge iteration to $12$ and visualize how the LTE trajectory follows the arc of MHLoRA.
...and 16 more figures

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

TL;DR

Abstract

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Authors

TL;DR

Abstract

Table of Contents

Figures (21)