Table of Contents
Fetching ...

A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios

Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, Noah Fiedel

TL;DR

This study benchmarks three LLM adaptation paradigms—In-Context Learning (ICL), Supervised Finetuning (SFT), and Low-Rank Adaptation (LoRA)—in data-scarce scenarios using a single base model, Gemma-3. It demonstrates that ICL preserves existing knowledge but struggles with complex skills, SFT achieves rapid skill acquisition yet suffers severe catastrophic forgetting, and LoRA delivers a practical balance by enabling skill learning while largely preserving prior knowledge. The work analyzes hyperparameter effects, especially LoRA rank and data count, and shows that LoRA updates are highly layer-specific, concentrating in upper layers to minimize disruption of pre-trained representations. The findings offer actionable guidance on choosing adaptation strategies based on data availability and the importance of knowledge retention, with LoRA emerging as a robust middle-ground for many real-world, data-limited tasks.

Abstract

The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model's general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.

A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios

TL;DR

This study benchmarks three LLM adaptation paradigms—In-Context Learning (ICL), Supervised Finetuning (SFT), and Low-Rank Adaptation (LoRA)—in data-scarce scenarios using a single base model, Gemma-3. It demonstrates that ICL preserves existing knowledge but struggles with complex skills, SFT achieves rapid skill acquisition yet suffers severe catastrophic forgetting, and LoRA delivers a practical balance by enabling skill learning while largely preserving prior knowledge. The work analyzes hyperparameter effects, especially LoRA rank and data count, and shows that LoRA updates are highly layer-specific, concentrating in upper layers to minimize disruption of pre-trained representations. The findings offer actionable guidance on choosing adaptation strategies based on data availability and the importance of knowledge retention, with LoRA emerging as a robust middle-ground for many real-world, data-limited tasks.

Abstract

The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce scenarios. We find that LoRA provides the most effective balance, successfully instilling new skills with minimal impact on the base model's general knowledge. In contrast, while SFT excels at skill acquisition, it is highly susceptible to catastrophic forgetting. ICL is effective for incorporating factual knowledge but struggles with complex skills. Our findings offer a practical framework for selecting an LLM adaptation strategy. We highlight the critical distinction between skill acquisition and knowledge integration, clarify the trade-offs between task-specific performance and the preservation of general capabilities.

Paper Structure

This paper contains 9 sections, 11 figures.

Figures (11)

  • Figure 1: Per-task accuracy comparison of ICL, LoRA$_{r=4}$, and SFT across 13 benchmarks. Left: 64 samples; right: 128 samples. Performance is strongly task-dependent. On planning tasks (Blocksworld, Logistics),SFT and LoRA significantly outperform ICL. For NLP tagging skills (FEATS, LEMMA, UPOS, XPOS), ICL is best with 64 samples, while with 128 samples LoRA/SFT largely close the gap, showing only minor differences. Knowledge-heavy tasks (NQ, GSM8K, GPQA) show minimal change with more shots. Overall, moving from 64 to 128 samples generally raises accuracy for skill-based tasks.
  • Figure 2: Skill acquisition comparison of ICL, LoRA$_{r=4}$, and SFT with 64 samples. We plot accuracy on the target skill Part-of-Speech Tagging (train/test) and a general knowledge benchmark (NQ) to measure catastrophic forgetting. Left: SFT rapidly masters the skill (high test accuracy) but suffers complete catastrophic forgetting, with NQ accuracy dropping to 0. Middle: LoRA also learns the skill effectively but, in sharp contrast to SFT, preserves general knowledge, as shown by the stable NQ accuracy. Right, ICL demonstrates partial skill acquisition at inference time (lower test accuracy) and, as it involves no weight updates, shows no knowledge degradation.
  • Figure 3: In-Context Learning performance on selected Skill Tasks Universal Part-of-Speech tagging, syntactic head prediction (Head), and Adversarial Natural Language Inference (ANIL). We selected two skill with improving accuracy UPOS and Head and one tasks (ANLI) with no increasing or even a drop in accuracy.
  • Figure 5: Supervised Fine-Tuning on UPOS results in rapid catastrophic forgetting. At learning rates of $10^{-3}$ (top), $5 \times 10^{-4}$ (middle) and $10^{-4}$ (bottom), the model's general abilities are lost within a few training steps. While the lowest learning rate ($10^{-4}$, bottom) avoids this severe forgetting, it also prevents the model from successfully acquiring the task.
  • Figure 6: SFT on a structured prediction task (syntactic head identification) with a learning rate of 0.001 leads to rapid catastrophic forgetting. The model exhibits both rapid catastrophic forgetting of its general pre-trained abilities and severe overfitting on the fine-tuning data. While training accuracy quickly reaches 100% even with few samples, test accuracy scales poorly, remaining near-zero with 16 samples and reaching only approximately 20% with 1024 samples.
  • ...and 6 more figures