Table of Contents
Fetching ...

SEE: Continual Fine-tuning with Sequential Ensemble of Experts

Zhilin Wang, Yafu Li, Xiaoye Qu, Yu Cheng

TL;DR

This work tackles catastrophic forgetting during continual fine-tuning of large language models by proposing SEE, a sequential ensemble of task-specific experts that integrates routing and generation within each expert. SEE reconstructs tasks with positive and negative indicators, trains a new expert per incoming task using LoRA-based adapters, and connects experts through sequential routing that defers to a base model when no expert should respond. Empirical results on the SuperNI benchmark show SEE outperforming rehearsal-based baselines and matching or surpassing multi-task learning in both average performance and forgetting metrics, with strong generalization to out-of-distribution queries. The approach offers practical benefits like near-linear scalability, competitive latency, and robust OOD handling, suggesting a promising direction for distributed model ensembling in continual learning.

Abstract

Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.

SEE: Continual Fine-tuning with Sequential Ensemble of Experts

TL;DR

This work tackles catastrophic forgetting during continual fine-tuning of large language models by proposing SEE, a sequential ensemble of task-specific experts that integrates routing and generation within each expert. SEE reconstructs tasks with positive and negative indicators, trains a new expert per incoming task using LoRA-based adapters, and connects experts through sequential routing that defers to a base model when no expert should respond. Empirical results on the SuperNI benchmark show SEE outperforming rehearsal-based baselines and matching or surpassing multi-task learning in both average performance and forgetting metrics, with strong generalization to out-of-distribution queries. The approach offers practical benefits like near-linear scalability, competitive latency, and robust OOD handling, suggesting a promising direction for distributed model ensembling in continual learning.

Abstract

Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.

Paper Structure

This paper contains 34 sections, 16 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The inference process in the SEE framework involves a query being passed sequentially through a series of experts until it matches one, which then generates a response. If an expert fails to produce a special indicator, the query is routed to the base model, which is considered to possess the best generalization ability.
  • Figure 2: Overview of the SEE Framework: The SEE framework operates in three steps when a new task is introduced: (1) Task Reconstruction: Current data are combined with sampled instances from previous tasks to guide expert routing and responses. (2) SFT: A new expert is trained using a new LoRA on the reconstructed task. (3) Inference: All experts are integrated into a MoE system through sequential routing, enabling powerful inferences by leveraging the entire system.
  • Figure 3: The ROUGE-L scores for SEE (10%), MTL, and AvgSTL across the 10 SuperNI tasks are presented in two figures, separated according to the magnitude of the values for clearer comparison.
  • Figure 4: The performance of different methods after continuous learning of 10 SuperNI tasks on the MMLU benchmark. SEE(10%)-AE and SEE(1%)-AE represent the average performance of experts in SEE.
  • Figure 5: Comparison of the perplexity distribution between SEE and MTL across 10 SuperNI tasks.
  • ...and 2 more figures