CEM: A Data-Efficient Method for Large Language Models to Continue Evolving From Mistakes
Haokun Zhao, Haixia Han, Jie Shi, Chengyu Du, Jiaqing Liang, Yanghua Xiao
TL;DR
The paper tackles the problem of keeping large language models up-to-date amid evolving knowledge by introducing Continue Evolving from Mistakes (CEM), a data-efficient method that collects continual pre-training data from model mistakes and trains models via a parallel CIT+CPT paradigm. CEM constructs targeted CPT data from knowledge points extracted from incorrect answers and supplements them with sources like Wikipedia and Bing, while simultaneously maintaining task-schema alignment through Normative, Extractive, Review, and General instructions. Empirical results across multiple open-source LLMs show substantial improvements on in-domain and out-of-domain QA benchmarks (up to 29.63% accuracy gains) and demonstrate improved stability against forgetting, especially when employing the Review instruction and random-replay strategies. The approach offers a practical pathway for continual evolution of LLMs with reduced data and computation, and suggests directions for extending to more tasks, additional data sources, and enhanced forgetting mitigation in future work.
Abstract
As world knowledge advances and new task schemas emerge, Continual Learning (CL) becomes essential for keeping Large Language Models (LLMs) current and addressing their shortcomings. This process typically involves continual instruction tuning (CIT) and continual pre-training (CPT) to enable these models to adapt to novel tasks and acquire critical knowledge. However, collecting sufficient CPT data and efficiently bridging knowledge gaps remain significant challenges. Inspired by the 'summarizing mistakes' strategy, we propose the Continue Evolving from Mistakes (CEM) method, a data-efficient approach aiming to collect CPT data and continually improve LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To further optimize data usage and mitigate forgetting, we introduce a novel training paradigm that combines CIT and CPT. Experiments show that CEM substantially enhances multiple models' performance on both in-domain and out-of-domain QA tasks, achieving gains of up to 29.63%. Code and datasets are available on https://anonymous.4open.science/r/cem-BB25.
