Table of Contents
Fetching ...

Layer-wise LoRA fine-tuning: a similarity metric approach

Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

TL;DR

This paper tackles the high cost of fine-tuning large language models by introducing a layer-wise selection strategy for LoRA-based fine-tuning that uses a representation-similarity metric to identify the most impactful transformer layers. The method defines a layer importance score using representation dissimilarity and selects the top-N layers to update, enabling substantial reductions in trainable parameters while preserving or improving task performance. Empirical results across encoder-only, decoder-only, and multimodal models show about a 50% reduction in trainable parameters with minimal or positive performance changes on GLUE, math/coding tasks, and ScienceQA, along with meaningful training-time and memory savings. The approach is orthogonal to existing PEFT methods and can be combined with LoRA variants to further improve efficiency, suggesting a practical path toward scalable, parameter-efficient fine-tuning for large models.

Abstract

Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA

Layer-wise LoRA fine-tuning: a similarity metric approach

TL;DR

This paper tackles the high cost of fine-tuning large language models by introducing a layer-wise selection strategy for LoRA-based fine-tuning that uses a representation-similarity metric to identify the most impactful transformer layers. The method defines a layer importance score using representation dissimilarity and selects the top-N layers to update, enabling substantial reductions in trainable parameters while preserving or improving task performance. Empirical results across encoder-only, decoder-only, and multimodal models show about a 50% reduction in trainable parameters with minimal or positive performance changes on GLUE, math/coding tasks, and ScienceQA, along with meaningful training-time and memory savings. The approach is orthogonal to existing PEFT methods and can be combined with LoRA variants to further improve efficiency, suggesting a practical path toward scalable, parameter-efficient fine-tuning for large models.

Abstract

Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99\% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50\%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
Paper Structure (7 sections, 4 equations, 6 figures, 5 tables)

This paper contains 7 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Trade-offs between LoRA with and without our method. We fine-tune LLaMA 2-7B, Mistral-7B and Gemma-7B on MetaMathQA and evaluate on GSM8K. Notably, our method reduces the number of trainable parameters while preserving predictive performance (in this case, increasing accuracy) compared to fine-tuning all layers with LoRA modules (standard practice). We observe the same behavior across different architectures, including multimodal models.
  • Figure 2: Overview of our method. We only fine-tune the subset of layers with the lowest similarity between their input and output representation (red parts).
  • Figure 3: Simple layer selection strategies. Red rectangles represent trainable layers and blue rectangles frozen ones.
  • Figure 4: Relative performance of our method with different decoder-only models across multiple tasks. Each value is the difference between the predictive performance of a fine-tune in all layers and in a subset of 50% of the layers; therefore, positive numbers indicate gains and negative numbers indicate drops.
  • Figure 5: Left: Speedup in fine-tuning of $\text{RoBERTa}_{\text{base}}$ in the GLUE benchmark. We calculate the speedup in comparison to the fine-tuning with LoRA modules in all layers. Right: Mean memory allocated per task in fine-tuning of $\text{RoBERTa}_{\text{base}}$ in the GLUE benchmark. We measure the maximum memory usage for each training step.
  • ...and 1 more figures