Table of Contents
Fetching ...

Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Naimul Haque

TL;DR

The paper addresses catastrophic forgetting during sequential fine-tuning of open-source LLMs with under $10$B parameters on GLUE NLU tasks (SST-2, MRPC, CoLA, MNLI). It adopts a continual instruction fine-tuning approach with prompt engineering to create task-specific prompts $X' = PE(X)$ and sequentially fine-tunes $M_0$ to $M_i$ on task $T_i$, evaluating retention by accuracy on previous tasks. Forgetting and learning are quantified as $\text{Forgetting} = \max_{0 \leq k \leq T} (a_{k,t}) - a_{T,t}$ and $\text{Learning} = \max_{0 \leq k \leq T} (a_{k,t}) - a_{0,t}$. Key findings show Phi-3.5-mini minimizes forgetting while maintaining learning; Orca-2-7B and Qwen2.5-7B achieve strong post-finetuning performance, with trade-offs between forgetting and learning as model size grows. The results inform continual learning for autonomous LLM-based agents and underscore the role of prompt design and fine-tuning strategies.

Abstract

Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information - a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models' abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.

Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

TL;DR

The paper addresses catastrophic forgetting during sequential fine-tuning of open-source LLMs with under B parameters on GLUE NLU tasks (SST-2, MRPC, CoLA, MNLI). It adopts a continual instruction fine-tuning approach with prompt engineering to create task-specific prompts and sequentially fine-tunes to on task , evaluating retention by accuracy on previous tasks. Forgetting and learning are quantified as and . Key findings show Phi-3.5-mini minimizes forgetting while maintaining learning; Orca-2-7B and Qwen2.5-7B achieve strong post-finetuning performance, with trade-offs between forgetting and learning as model size grows. The results inform continual learning for autonomous LLM-based agents and underscore the role of prompt design and fine-tuning strategies.

Abstract

Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU) tasks. As we progress toward an agentic world where LLM-based agents autonomously handle specialized tasks, it becomes crucial for these models to adapt to new tasks without forgetting previously learned information - a challenge known as catastrophic forgetting. This study evaluates the continual fine-tuning of various open-source LLMs with different parameter sizes (specifically models under 10 billion parameters) on key NLU tasks from the GLUE benchmark, including SST-2, MRPC, CoLA, and MNLI. By employing prompt engineering and task-specific adjustments, we assess and compare the models' abilities to retain prior knowledge while learning new tasks. Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities, making them well-suited for continual learning environments. Additionally, models like Orca-2-7b and Qwen2.5-7B demonstrate impressive learning abilities and overall performance after fine-tuning. This work contributes to understanding catastrophic forgetting in LLMs and highlights prompting engineering to optimize model performance for continual learning scenarios.

Paper Structure

This paper contains 9 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The figure illustrates the Continual Finetuning workflow. $M_0$ represents the base language model, and subsequent models $M_1, M_2, \dots, M_n$ denote the fine-tuned versions after training on tasks $T_1, T_2, \dots, T_n$. The figure also highlights the process of generating task-specific prompts and the continual evaluation to assess the model's retention.
  • Figure 2: Performance of various models across continual fine-tuning episodes for the task SST2. The solid blue line highlights the model with the highest overall performance, while the solid orange line represents the model with the least amount of forgetting (the smallest drop in performance between tasks). Dashed lines indicate the performance of other models. This diagram illustrates both the learning capacity and retention ability of each model over successive tasks.
  • Figure 3: Bar graph displaying model performance for the task SST2 on catastrophic forgetting (reverse) and learning rates, with higher models showing more significant trade-offs. Phi-3.5-mini stands out with minimal forgetting and moderate learning.