Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning
Erxin Yu, Jing Li, Ming Liao, Qi Zhu, Boyang Xue, Minghui Xu, Baojun Wang, Lanqing Hong, Fei Mi, Lifeng Shang
TL;DR
The paper addresses the challenge of mathematical reasoning in large language models by proposing Self-Error-Instruct (SEI), which generalizes training data from error types instead of individual mistakes. SEI identifies bad cases with a target model, analyzes errors using an instructor model to extract error keyphrases, clusters them into error types, and synthesizes type-specific data via a self-instruct process, followed by one-shot data selection and iterative fine-tuning. Empirical results on GSM8K, MATH, and several out-of-domain datasets show consistent improvements across multiple target models, with notable gains for some models and data-efficiency advantages from error-type data. The work demonstrates that error-type–driven data synthesis and careful data selection can markedly enhance LLMs' mathematical reasoning and generalization, though it incurs costs from using a high-tier instructor and relies on dataset scope and one-shot validation.
Abstract
Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization.
