Table of Contents
Fetching ...

Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, Sunayana Sitaram

TL;DR

This paper studies a two-phase CFT process in which an English-only end-to-end fine-tuned LLM is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability).

Abstract

A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model's performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability). We observe that the ``similarity'' of Phase 2 tasks with Phase 1 determines the LLM's adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM's task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.

Exploring Continual Fine-Tuning for Enhancing Language Ability in Large Language Model

TL;DR

This paper studies a two-phase CFT process in which an English-only end-to-end fine-tuned LLM is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability).

Abstract

A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model's performance on languages in which the model is already proficient (usually English). Continual fine-tuning (CFT) is the process of sequentially fine-tuning an LLM to enable the model to adapt to downstream tasks with varying data distributions and time shifts. This paper focuses on the language adaptability of LLMs through CFT. We study a two-phase CFT process in which an English-only end-to-end fine-tuned LLM from Phase 1 (predominantly Task Ability) is sequentially fine-tuned on a multilingual dataset -- comprising task data in new languages -- in Phase 2 (predominantly Language Ability). We observe that the ``similarity'' of Phase 2 tasks with Phase 1 determines the LLM's adaptability. For similar phase-wise datasets, the LLM after Phase 2 does not show deterioration in task ability. In contrast, when the phase-wise datasets are not similar, the LLM's task ability deteriorates. We test our hypothesis on the open-source \mis\ and \llm\ models with multiple phase-wise dataset pairs. To address the deterioration, we analyze tailored variants of two CFT methods: layer freezing and generative replay. Our findings demonstrate their effectiveness in enhancing the language ability of LLMs while preserving task performance, in comparison to relevant baselines.

Paper Structure

This paper contains 35 sections, 2 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparing hidden activations for Mistral-7B and LLaMA-3-8B during our two-phase continual fine-tuning process. We prompt each model with examples from MTBenchzheng2024judging, and visualize the similarity between the mean hidden activations, for each model layer. For datasets that encode "similar" tasks (Alpaca & MultiAlpaca), model's task ability does not decline (e.g., 3% gain for IFEval). For non-similar datasets (Instruct & MultiAlpaca), the task ability declines (e.g., 8% decline for IFEval). Here, Phase 2 model representations do not align with Phase 1's; thus, suggesting greater model weight interference and a decline in task ability.
  • Figure 2: We see a greater change in the variation of the representations for non-similar datasets (e.g., Instruct & MultiAlpaca) compared to similar datasets (e.g., Alpaca & MultiAlpaca). Interestingly, for LLaMA-3-8B the change is large across layers and a magnitude higher than Mistral-7B. For Mistral-7B, we see the later layers showing the most change.
  • Figure : (a) Mistral-7B
  • Figure C4: Visualizing Variance in Model Representations for Mistral-7B Mitigating Strategies: We see a decrease in the variance of model representations for models trained using our mitigation strategies compared to vanilla Phase 2 models (see Figure \ref{['fig:variance_1']}).
  • Figure : (a) Mistral-7B
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Dataset Embedding Similarity (DES)
  • Definition 2: Model Parameter Difference (MPD)