CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

Haokun Zhao; Haixia Han; Jie Shi; Chengyu Du; Jiaqing Liang; Yanghua Xiao

CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

Haokun Zhao, Haixia Han, Jie Shi, Chengyu Du, Jiaqing Liang, Yanghua Xiao

TL;DR

The paper tackles the problem of keeping large language models up-to-date amid evolving knowledge by introducing Continue Evolving from Mistakes (CEM), a data-efficient method that collects continual pre-training data from model mistakes and trains models via a parallel CIT+CPT paradigm. CEM constructs targeted CPT data from knowledge points extracted from incorrect answers and supplements them with sources like Wikipedia and Bing, while simultaneously maintaining task-schema alignment through Normative, Extractive, Review, and General instructions. Empirical results across multiple open-source LLMs show substantial improvements on in-domain and out-of-domain QA benchmarks (up to 29.63% accuracy gains) and demonstrate improved stability against forgetting, especially when employing the Review instruction and random-replay strategies. The approach offers a practical pathway for continual evolution of LLMs with reduced data and computation, and suggests directions for extending to more tasks, additional data sources, and enhanced forgetting mitigation in future work.

Abstract

As world knowledge advances and new task schemas emerge, Continual Learning (CL) becomes essential for keeping Large Language Models (LLMs) current and addressing their shortcomings. This process typically involves continual instruction tuning (CIT) and continual pre-training (CPT) to enable these models to adapt to novel tasks and acquire critical knowledge. However, collecting sufficient CPT data and efficiently bridging knowledge gaps remain significant challenges. Inspired by the 'summarizing mistakes' strategy, we propose the Continue Evolving from Mistakes (CEM) method, a data-efficient approach aiming to collect CPT data and continually improve LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To further optimize data usage and mitigate forgetting, we introduce a novel training paradigm that combines CIT and CPT. Experiments show that CEM substantially enhances multiple models' performance on both in-domain and out-of-domain QA tasks, achieving gains of up to 29.63%. Code and datasets are available on https://anonymous.4open.science/r/cem-BB25.

CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 8 figures, 7 tables)

This paper contains 23 sections, 1 equation, 8 figures, 7 tables.

Introduction
Related Work
Methods
CEM Method
Supplemental Training Data
Experimental Setup
Training Setup
Datasets
Base Language Models
Metrics
Experiments
Main Results
Data Source Analysis
Mechanistic Analysis of Extractive and Review Instruction Components
Multiple Iterations Analysis
...and 8 more sections

Figures (8)

Figure 1: Two potential triggers for poor model performance: (1) Task Schema Unfamiliarity, and (2) Lack of Task-relevant Knowledge. Unfamiliarity with the task schema can cause deviations from expected interaction styles, while insufficient task knowledge may lead to hallucinations. Instruction tuning has been shown to be effective for addressing the former, but poor for the latter gekhman2024doesfinetuningllmsnewzhou2023lima.
Figure 2: The pipeline of CEM method.
Figure 3: This figure presents the accuracy of models using the CEM-P method with different data sources on Xiezhi and CMMLU tasks. The suffixes Wiki, Bing, and Mix indicate the sources of the Supplementary Corpus, with Mix representing using double the samples sourced from the combined dataset of Wikipedia and Bing.
Figure 4: The figure shows the ablation experimental results of three different models on the Xiezhi task after training with CEM. 'Avg of CEM-R$\alpha$' and 'Best of CEM-R$\alpha$' represent the average and best results of the CEM-R with $\alpha$ of [0, 0.5, 1].
Figure 5: The figure presents the R2W and W2R metrics of the models after CEM supplemental training. 'Avg of CEM-R$\alpha$' and 'Best of CEM-R$\alpha$' represent the average and best results of the CEM-R with $\alpha$ of [0, 0.5, 1].
...and 3 more figures

CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

TL;DR

Abstract

CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes

Authors

TL;DR

Abstract

Table of Contents

Figures (8)