Training Neural Networks from Scratch with Parallel Low-Rank Adapters
Minyoung Huh, Brian Cheung, Jeremy Bernstein, Phillip Isola, Pulkit Agrawal
TL;DR
The paper investigates training neural networks from scratch under resource constraints by leveraging low-rank adapters (LoRA) and introduces LoRA-the-Explorer (LTE), a bi-level optimization that enables parallel updates across $N$ LoRA heads with periodic merging. Vanilla LoRA struggles to match full pre-training performance due to rank limitations, but LTE compensates by aggregating diverse, parallel low-rank updates to approximate a full-rank update while reducing synchronization and memory load, as formalized by $\hat{W} = W + \frac{s}{N} \sum_{n=1}^N B_n A_n$ and merge rule $\Delta_{lora} = \frac{1}{N} \sum \delta_{lora_n}$. Through experiments on Vision Transformers across datasets including ImageNet-1K, LTE achieves competitive pre-training performance with fewer inter-node communications, albeit requiring more training samples (e.g., ~40% more) and enabling larger, memory-efficient models on low-memory devices; ablations show that the number of LoRA heads, the rank $r$, and merge spacing $T$ critically influence convergence and overall accuracy. The work also analyzes gradient noise, LoRA head alignment, initialization strategies, and connections to federated learning and linear-mode connectivity, outlining practical open questions such as acceleration of the final training phase and dynamic head/rank selection for scalable pre-training in constrained environments.
Abstract
The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.
