Table of Contents
Fetching ...

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

Kaiye Zhou, Shucheng Wang, Jun Xu

TL;DR

SwitchLoRA presents a dynamic low-rank adaptation strategy that frequently swaps LoRA vectors during pre-training to better emulate full-rank updates without incurring full-parameter overhead. By maintaining candidate vectors, resetting optimizer states at swap events, and decaying swapping frequency, it achieves perplexities on par with or better than full-rank training (e.g., $PPL$ dropping from $15.23$ to $15.01$ on a $1.3$B model) while cutting communication by up to 54% and memory usage by about 13%. It outperforms ReLoRA and GaLore in perplexity and yields about a 1% average gain on GLUE after full fine-tuning, demonstrating improved reasoning capabilities without sacrificing efficiency. The approach leverages a theoretically grounded initialization and a switching paradigm that preserves optimizer stability, suggesting practical gains for large-scale distributed pre-training and 3D parallelism with reduced bandwidth and memory demands.

Abstract

In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.

SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

TL;DR

SwitchLoRA presents a dynamic low-rank adaptation strategy that frequently swaps LoRA vectors during pre-training to better emulate full-rank updates without incurring full-parameter overhead. By maintaining candidate vectors, resetting optimizer states at swap events, and decaying swapping frequency, it achieves perplexities on par with or better than full-rank training (e.g., dropping from to on a B model) while cutting communication by up to 54% and memory usage by about 13%. It outperforms ReLoRA and GaLore in perplexity and yields about a 1% average gain on GLUE after full fine-tuning, demonstrating improved reasoning capabilities without sacrificing efficiency. The approach leverages a theoretically grounded initialization and a switching paradigm that preserves optimizer stability, suggesting practical gains for large-scale distributed pre-training and 3D parallelism with reduced bandwidth and memory demands.

Abstract

In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead and memory usage during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training. Specifically, ReLoRA restricts the frequency of updates to preserve optimizer states consistency, hindering its ability to closely approximate full-rank training behavior. Meanwhile, GaLore relies on Singular Value Decomposition (SVD) to approximate the full-rank space, which introduces accuracy loss during the approximation process. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model, while also cutting communication overhead by 54\% and memory usage by 13\%. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1\% over the full-rank pre-trained model.
Paper Structure (41 sections, 19 equations, 11 figures, 11 tables, 2 algorithms)

This paper contains 41 sections, 19 equations, 11 figures, 11 tables, 2 algorithms.

Figures (11)

  • Figure 1: SwitchLoRA: An enhanced LoRA with dynamic vector switching for pre-training. In traditional LoRA, an adapter $\mathbf{B}\mathbf{A}$ is added to the matrix $\mathbf{W}$ of linear layers. $\mathbf{B}$ and $\mathbf{A}$ are trained while $\mathbf{W}$ is kept frozen (as depicted in the left part of the figure). SwitchLoRA enhances this by dynamically switching vectors within $\mathbf{B}$ and $\mathbf{A}$. The figure illustrates an example of this process: when the third column(labeled as black ③) of $\mathbf{B}$ is switched, the corresponding third row(labeled as white ③) of $\mathbf{A}$ is temporarily frozen. Similarly, when the second row(labeled as black ②) of $\mathbf{A}$ is switched, the corresponding second column(labeled as white ②) of $\mathbf{B}$ is also temporarily frozen.
  • Figure 2: Loss results for 130M, 250M, and 350M models with a LoRA rank of $128$.
  • Figure 3: Loss results for 250M, 350M and 1.3B models using higher LoRA ranks.
  • Figure 4: Comparison between ReLoRA and SwitchLoRA. In the figure, red circles denotes the steps at which the parameters of the LoRA adapter are reset. In the left figure, ReLoRA utilizes 5,000 steps of full-rank pre-training, while SwitchLoRA uses 200 steps. In the right figure, both algorithms employ 1,000 steps of full-rank pre-training.
  • Figure 5: Future work roadmap.
  • ...and 6 more figures