Table of Contents
Fetching ...

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

Jiashuo Sun, Yi Luo, Yeyun Gong, Chen Lin, Yelong Shen, Jian Guo, Nan Duan

TL;DR

This work tackles the instability of chain-of-thought prompting by addressing errors in demonstrations and the sensitivity to exemplar difficulty. It introduces Iter-CoT, an iterative bootstrapping framework that builds a self-correcting demonstration pool through initialization, iterative revision, and summarization, usable in both labeled and unlabeled settings. Across ten datasets spanning arithmetic, commonsense, and symbolic reasoning, Iter-CoT achieves state-of-the-art or competitive results, with self-consistency further boosting performance and robustness across foundation models. The approach emphasizes the value of learning from corrected mistakes, enriched contextual reasoning, and carefully chosen exemplars, while acknowledging costs in construction and evaluator reliability as limitations. Collectively, Iter-CoT advances robust, context-rich in-context learning for complex reasoning tasks.

Abstract

Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. However, the reasoning chains of demonstrations generated by LLMs are prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars (overly simplistic or complex), can affect overall performance among varying levels of difficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains. By utilizing iterative bootstrapping, our approach enables LLMs to autonomously rectify errors, resulting in more precise and comprehensive reasoning chains. Simultaneously, our approach selects challenging yet answerable questions accompanied by reasoning chains as exemplars with a moderate level of difficulty, which enhances the LLMs' generalizability across varying levels of difficulty. Experimental results indicate that Iter-CoT exhibits superiority, achieving competitive performance across three distinct reasoning tasks on ten datasets.

Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

TL;DR

This work tackles the instability of chain-of-thought prompting by addressing errors in demonstrations and the sensitivity to exemplar difficulty. It introduces Iter-CoT, an iterative bootstrapping framework that builds a self-correcting demonstration pool through initialization, iterative revision, and summarization, usable in both labeled and unlabeled settings. Across ten datasets spanning arithmetic, commonsense, and symbolic reasoning, Iter-CoT achieves state-of-the-art or competitive results, with self-consistency further boosting performance and robustness across foundation models. The approach emphasizes the value of learning from corrected mistakes, enriched contextual reasoning, and carefully chosen exemplars, while acknowledging costs in construction and evaluator reliability as limitations. Collectively, Iter-CoT advances robust, context-rich in-context learning for complex reasoning tasks.

Abstract

Large language models (LLMs) can achieve highly effective performance on various reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting as demonstrations. However, the reasoning chains of demonstrations generated by LLMs are prone to errors, which can subsequently lead to incorrect reasoning during inference. Furthermore, inappropriate exemplars (overly simplistic or complex), can affect overall performance among varying levels of difficulty. We introduce Iter-CoT (Iterative bootstrapping in Chain-of-Thoughts Prompting), an iterative bootstrapping approach for selecting exemplars and generating reasoning chains. By utilizing iterative bootstrapping, our approach enables LLMs to autonomously rectify errors, resulting in more precise and comprehensive reasoning chains. Simultaneously, our approach selects challenging yet answerable questions accompanied by reasoning chains as exemplars with a moderate level of difficulty, which enhances the LLMs' generalizability across varying levels of difficulty. Experimental results indicate that Iter-CoT exhibits superiority, achieving competitive performance across three distinct reasoning tasks on ten datasets.
Paper Structure (43 sections, 10 figures, 22 tables)

This paper contains 43 sections, 10 figures, 22 tables.

Figures (10)

  • Figure 1: Effect of different demonstrations (Simple-CoT v.s., Complex-CoT) on different questions (difficulty from 2-hop to 9-hop) on GSM8K dataset.
  • Figure 2: Impact of wrong exemplars on three different benchmarks (GSM8K, CSQA and Letter).
  • Figure 3: Effect of re-answering the question based on the hint and previous rationales.
  • Figure 4: The illustration of the value of revised examples. Challenging yet answerable exemplars as demonstrations can enhance the model's reasoning performance.
  • Figure 5: The workflow of Iter-CoT: 1. The construction of the demonstration pool: 1) Initialization: we query the LLMs to generate reasoning chain and answer with Zero-Shot-CoT zeroshot. 2) Bootstrapping: we use Revise-Prompt to guide LLMs to revise the reasoning chain repeatedly until the generated CoT is completely accurate. 3) Summarization: we prompt LLMs with Summary-Prompt to generate the final reasoning chain (referred to as Final CoT) based on the contextual information provided within the overall process. Then, we add the Final CoT where the answer is correct with the corresponding question as an example to the demonstration pool. 2. Inference: LLMs generate answers for the test questions with the demonstrations sampled from the constructed demonstration pool.
  • ...and 5 more figures