Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yunjie Ji, Yiping Peng, Han Zhao, Xiangang Li
TL;DR
The paper tackles the challenge of improving LLM reasoning without retraining by introducing Multi-round Thinking, a test-time scaling method that iteratively refines answers using only the previous round's final output as input. The approach is simple yet effective: after an initial response, subsequent rounds re-evaluate the prompt with the prior answer, discarding intermediate reasoning traces to reduce cognitive inertia. Empirical results across DeepSeek-R1, QwQ-32B, and AM-32B Distill on benchmarks such as AIME-2024, MATH-500, GPQA-Diamond, and LiveCodeBench show consistent performance gains from Round 1 to Round 2, along with insights into reduced hedging and shorter responses. A preliminary SFT attempt did not yield immediate improvements, suggesting future research directions in data-driven iterative reasoning. Overall, Multi-round Thinking offers a practical, training-free path to boost LLM reasoning applicable across diverse tasks, with a trade-off in additional wait time during inference.
Abstract
Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.
