SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information
Kaiye Zhou, Shucheng Wang, Jun Xu
TL;DR
SwitchLoRA presents a dynamic low-rank adaptation strategy that frequently swaps LoRA vectors during pre-training to better emulate full-rank updates without incurring full-parameter overhead. By maintaining candidate vectors, resetting optimizer states at swap events, and decaying swapping frequency, it achieves perplexities on par with or better than full-rank training (e.g., $PPL$ dropping from $15.23$ to $15.01$ on a $1.3$B model) while cutting communication by up to 54% and memory usage by about 13%. It outperforms ReLoRA and GaLore in perplexity and yields about a 1% average gain on GLUE after full fine-tuning, demonstrating improved reasoning capabilities without sacrificing efficiency. The approach leverages a theoretically grounded initialization and a switching paradigm that preserves optimizer stability, suggesting practical gains for large-scale distributed pre-training and 3D parallelism with reduced bandwidth and memory demands.
Abstract
In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.
