Table of Contents
Fetching ...

Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models

Chengyu Du, Jinyi Han, Yizhou Ying, Aili Chen, Qianyu He, Haokun Zhao, Sirui Xia, Haoran Guo, Jiaqing Liang, Zulong Chen, Liangyue Li, Yanghua Xiao

TL;DR

Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks, and in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.

Abstract

Recent advancements in large language models (LLMs) have demonstrated that progressive refinement, rather than providing a single answer, results in more accurate and thoughtful outputs. However, existing methods often rely heavily on supervision signals to evaluate previous responses, making it difficult to assess output quality in more open-ended scenarios effectively. Additionally, these methods are typically designed for specific tasks, which limits their generalization to new domains. To address these limitations, we propose Progressive Thought Refinement (PTR), a framework that enables LLMs to refine their responses progressively. PTR operates in two phases: (1) Thought data construction stage: We propose a weak and strong model collaborative selection strategy to build a high-quality progressive refinement dataset to ensure logical consistency from thought to answers, and the answers are gradually refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training structure to mask the "thought" and adjust loss weights to encourage LLMs to refine prior thought, teaching them to implicitly understand "how to improve" rather than "what is correct." Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks (avg. from 49.6% to 53.5%) without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.

Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models

TL;DR

Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks, and in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.

Abstract

Recent advancements in large language models (LLMs) have demonstrated that progressive refinement, rather than providing a single answer, results in more accurate and thoughtful outputs. However, existing methods often rely heavily on supervision signals to evaluate previous responses, making it difficult to assess output quality in more open-ended scenarios effectively. Additionally, these methods are typically designed for specific tasks, which limits their generalization to new domains. To address these limitations, we propose Progressive Thought Refinement (PTR), a framework that enables LLMs to refine their responses progressively. PTR operates in two phases: (1) Thought data construction stage: We propose a weak and strong model collaborative selection strategy to build a high-quality progressive refinement dataset to ensure logical consistency from thought to answers, and the answers are gradually refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training structure to mask the "thought" and adjust loss weights to encourage LLMs to refine prior thought, teaching them to implicitly understand "how to improve" rather than "what is correct." Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks (avg. from 49.6% to 53.5%) without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time.

Paper Structure

This paper contains 38 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration Our approaches. (A) Pipeline of our progressive refinement Dataset construction. We first prepare queries from the general open domain datasets, and pre-processing queries in three steps. Then we use a strong weak model collaborative selection strategy to generate thoughts and answers for each query. We also implement In-context Learning (ICL) and Consistency Filtering to ensure the quality of the thought process. (B) The illustration of Weighted Thought Masking Fine-tuning. Aiming at training the model to produce a better response in the next attempt and ensure logical consistency during the thought process. The difference between our method and IFT is that we use thought-mask techniques to ask model to generate better responses. (C) Pipeline of our PTR. Given a query $Q$, LLMs think progressively and refine their responses based on their own previous thought and refinement instruction. LLMs refined its mistakes on the second attempt, as well as gave a more thoughtful answer at a later iteration.
  • Figure 2: Code example shows PTR can refine beyond correction. The PTR goes through three rounds, providing higher quality response for each iterations. In first interation, model return with simply output. In second interation, model add more details like considering the empty list. In third interation, model structured the code and futher add type checking and errors information.
  • Figure 3: Plot A: Multi-line plot showing the performance trends for multiple tasks, along with average performance and variance. Plot B: Bar plot comparing initial, base, and final performance for each task. Plot C: Box plot displaying the performance distribution across tasks. Plot D: Heat map representing task performance across training steps.
  • Figure 4: Performance of PTR over ten iterations across different tasks. Performance of PTR over ten iterations across different tasks. The left plots show accuracy improvements in mathematical reasoning (GSM8K and MATH), reasoning tasks (ARC, GPQA, Winogrande, CommonsenseQA), comprehension tasks (MMLU, DROP, XSum), and coding tasks (HumanEval). More details are in Appendix \ref{['appendix:Iteration-study']}. Baseline performance is indicated by dashed lines. The right plots show performance over six iterations with radar charts, illustrating improvement over the first three iterations.
  • Figure 5: performance of a model under different temperature settings during inference
  • ...and 2 more figures