Table of Contents
Fetching ...

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Xiao Wang, Yuansen Zhang, Tianze Chen, Songyang Gao, Senjie Jin, Xianjun Yang, Zhiheng Xi, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

TRACE presents an eight-task continual learning benchmark for aligned LLMs, spanning domain-specific, multilingual, code-generation, and mathematical reasoning domains, with unified evaluation metrics for general ability, instruction-following, and safety. Experimental results show significant general-ability forgetting and instruction-following declines after continual training, though multilingual performance can improve and reasoning-based tasks can help preserve capabilities. The paper also proposes Reasoning-augmented Continual Learning (RCL), which generates task analyses and rationales to guide training, achieving competitive target-task performance with fewer data and better retention of reasoning and general abilities. Overall, TRACE provides a rigorous, diverse testing ground for CL in LLMs and advocates leveraging intrinsic reasoning and transfer capabilities to mitigate catastrophic forgetting in practical alignment settings.

Abstract

Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs, owing to both their simplicity and the models' potential exposure during instruction tuning. In this paper, we introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs. TRACE consists of 8 distinct datasets spanning challenging tasks including domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. All datasets are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities. For example, the accuracy of llama2-chat 13B on gsm8k dataset declined precipitously from 28.8\% to 2\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Empirical findings suggest that tasks inherently equipped with reasoning paths contribute significantly to preserving certain capabilities of LLMs against potential declines. Motivated by this, we introduce the Reasoning-augmented Continual Learning (RCL) approach. RCL integrates task-specific cues with meta-rationales, effectively reducing catastrophic forgetting in LLMs while expediting convergence on novel tasks.

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

TL;DR

TRACE presents an eight-task continual learning benchmark for aligned LLMs, spanning domain-specific, multilingual, code-generation, and mathematical reasoning domains, with unified evaluation metrics for general ability, instruction-following, and safety. Experimental results show significant general-ability forgetting and instruction-following declines after continual training, though multilingual performance can improve and reasoning-based tasks can help preserve capabilities. The paper also proposes Reasoning-augmented Continual Learning (RCL), which generates task analyses and rationales to guide training, achieving competitive target-task performance with fewer data and better retention of reasoning and general abilities. Overall, TRACE provides a rigorous, diverse testing ground for CL in LLMs and advocates leveraging intrinsic reasoning and transfer capabilities to mitigate catastrophic forgetting in practical alignment settings.

Abstract

Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs, owing to both their simplicity and the models' potential exposure during instruction tuning. In this paper, we introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs. TRACE consists of 8 distinct datasets spanning challenging tasks including domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. All datasets are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Our experiments show that after training on TRACE, aligned LLMs exhibit significant declines in both general ability and instruction-following capabilities. For example, the accuracy of llama2-chat 13B on gsm8k dataset declined precipitously from 28.8\% to 2\% after training on our datasets. This highlights the challenge of finding a suitable tradeoff between achieving performance on specific tasks while preserving the original prowess of LLMs. Empirical findings suggest that tasks inherently equipped with reasoning paths contribute significantly to preserving certain capabilities of LLMs against potential declines. Motivated by this, we introduce the Reasoning-augmented Continual Learning (RCL) approach. RCL integrates task-specific cues with meta-rationales, effectively reducing catastrophic forgetting in LLMs while expediting convergence on novel tasks.
Paper Structure (43 sections, 6 equations, 6 figures, 31 tables)

This paper contains 43 sections, 6 equations, 6 figures, 31 tables.

Figures (6)

  • Figure 1: An overview of TRACE benchmark. TRACE consists of two main components: 1) A selection of eight datasets constituting a tailored set of tasks for continual learning, covering challenges in domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning. 2) A post-training evaluation of LLM capabilities. In addition to traditional continual learning metrics, we introduce General Ability Delta, Instruction Following Delta, and Safety Delta to evaluate shifts in LLM's inherent abilities.
  • Figure 2: GPT-4 evaluation with llama-13b-chat, comparing 3 different baselines (Replay, LoRA and Sequential) to the base model across tasks including helpful and safety.
  • Figure 3: Performance evaluation of LLaMA-2-7B-Chat's SeqFT on the TRACE benchmark across varying sample sizes (500, 1000, 5000) and training epochs (1, 3, 5, 10 (except for 5000)).
  • Figure 4: Evolution of LLMs' reasoning capabilities post-training on different tasks, measured using the BBH performance metric. We report the results of LLaMA-2-7B-chat and LLaMA-2-13B-chat.
  • Figure 5: An overview of Reasoning-augmented continual learning method. Our method unfolds in two stages: 1) Automatic annotation of sample reasoning paths using GPT-4. We guide GPT-4 through in-context learning and validate the generated paths via answer verification. 2) Continual learning on reasoning-augmented dataset.
  • ...and 1 more figures