Table of Contents
Fetching ...

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

TL;DR

Fine-tuning large language models is costly in both trainable parameters and memory. The authors introduce LaMDA, a spectrally decomposed low-dimensional adaptation that freezes the first projection PMA, trains a small low-dimensional adapter S between A and B, and gradually freezes the second projection PMB, reducing parameter count and activation memory; they further enhance it with LaMDA++ for adaptive layer-wise rank allocation guided by pre-trained weight spectra. Across DeBERTa-V3, BART-large, and LLaMA2-7B, LaMDA achieves similar or better task performance while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak memory, enabling efficient fine-tuning on commodity GPUs. The results demonstrate a scalable, memory-efficient path for adapting very large models to downstream tasks, with LaMDA++ offering additional gains through energy-based rank allocation.

Abstract

Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

TL;DR

Fine-tuning large language models is costly in both trainable parameters and memory. The authors introduce LaMDA, a spectrally decomposed low-dimensional adaptation that freezes the first projection PMA, trains a small low-dimensional adapter S between A and B, and gradually freezes the second projection PMB, reducing parameter count and activation memory; they further enhance it with LaMDA++ for adaptive layer-wise rank allocation guided by pre-trained weight spectra. Across DeBERTa-V3, BART-large, and LLaMA2-7B, LaMDA achieves similar or better task performance while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak memory, enabling efficient fine-tuning on commodity GPUs. The results demonstrate a scalable, memory-efficient path for adapting very large models to downstream tasks, with LaMDA++ offering additional gains through energy-based rank allocation.

Abstract

Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.
Paper Structure (19 sections, 12 equations, 5 figures, 10 tables)

This paper contains 19 sections, 12 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: (a) LoRA DBLP:conf/iclr/HuSWALWWC22. (b) VERA DBLP:journals/corr/abs-2310-11454. (c) LaMDA. At the beginning, PMB is trainable and gradually freezes based on the singular values. After $t_i$ iterations, PMB is completely frozen, and only the LDA is fine-tuned.
  • Figure 2: GPU memory usage of LLaMA2-7B on different fine-tuning methods including ours (LaMDA).
  • Figure 3: Layer-wise energy-score of the first 32 ranks of each linear module, normalized over the total energy-score of the same module, evaluated on a pre-trained LLaMA2-7B.
  • Figure 4: Peak GPU memory usage during fine-tuning BART-large on XSUM dataset.
  • Figure 5: Training Curve of LLaMA2-7B on Wikitext-2.