ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting
Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen
TL;DR
The paper addresses the challenge that existing Chain-of-Thought prompts for complex reasoning are often low-quality and inconsistent. It introduces CoTGenius, a data-centric framework that evolves CoT prompts through complicate, diversify, and specify strategies and filters with evolutionary success judgement and correctness verification, generating 44,335 prompts. ChainLM is then created by fine-tuning Llama 2-Chat 7B/13B on this improved CoT data, augmented by a step-level debating mechanism to mitigate cumulative errors in intermediate steps. Across nine complex reasoning benchmarks, ChainLM substantially outperforms many open-source baselines and approaches the performance of some closed-source models, with ablation analyses confirming the importance of data composition and the debating strategy. The work provides a practical, scalable pathway to enhance open-source LLM reasoning and releases the dataset and code for community use.
Abstract
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs), establishing itself as a primary approach to solving complex reasoning tasks. Existing CoT synthesis approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. In response to this challenge, we present an empirical investigation of CoT prompting and introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts. CoTGenius is developed based on three major evolution strategies, i.e., complicate, diversify, and specify-alongside two filtering mechanisms: evolutionary success judgement and correctness verification. We further employ CoTGenius to create an extensive CoT dataset, and subsequently fine-tune the Llama 2-Chat 7B and 13B models on this dataset. We call the resulting model ChainLM. To deal with the cumulative error issue in reasoning steps, we propose a step-level debating method, wherein multiple debaters discuss each reasoning step to arrive at the correct answer. Extensive experiments demonstrate that our ChainLM models exhibit enhanced proficiency in addressing a spectrum of complex reasoning problems compared to existing models. In addition, we conduct an in-depth analysis of the impact of data categories within CoTGenius on the model performance. We release our dataset and code at https://github.com/RUCAIBox/ChainLM.
