Table of Contents
Fetching ...

Recursive Think-Answer Process for LLMs and VLMs

Byung-Kwan Lee, Youngchae Chee, Yong Man Ro

TL;DR

By analyzing the frequency of"Oops"-like expressions in model responses, it is found that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning.

Abstract

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.

Recursive Think-Answer Process for LLMs and VLMs

TL;DR

By analyzing the frequency of"Oops"-like expressions in model responses, it is found that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning.

Abstract

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
Paper Structure (24 sections, 9 equations, 7 figures, 15 tables, 1 algorithm)

This paper contains 24 sections, 9 equations, 7 figures, 15 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overall accuracy (%) of numerous large language models (LLMs) on five evaluation benchmarks—AIME25 patel2024aime, HMMT Feb 25 hmmt, OmniMath gao2024omni, GPQA du2025supergpqa, and LiveCodeBench jain2024livecodebench.
  • Figure 2: Overall accuracy (%) of numerous vision language models (VLMs) on five evaluation benchmarks—MMMU yue2023mmmu, MathVista lu2023mathvista, OlympiadBench he2024olympiadbench, MathVision wang2024measuring, and MMMU-Pro yue2024mmmu.
  • Figure 3: Qualitative example of recursive think–answer process on a combinatorics question. The model iteratively refines its solution across multiple reasoning cycles, successfully correcting initial misconceptions such as off-by-one errors.
  • Figure 4: Recursive Think-Answer Process. Given a question $q$, base LLM/VLM $\pi_{\theta}$ recursively generates multiple Think-Answers $o^{(t)}$ until the answer is correct $t=M$. In this example, effective recursion depth $M=\text{3}$. A pre-trained Confidence Generator $\mathbb{C}_{\phi}$ assess each question and Think-Answer pair $(q, o^{(t)})$ then generates confidence score $\text{Conf}^{(t)}$. This confidence score is used to formulate confidence-based reward -- $R_{\text{Increase}}$ and $R_{\text{Final}}$ -- which serves as a sufficient reinforcement signal to train the model to recursively generate higher confidence Think-Answers until intrinsic confidence is high enough. Note that full responses for this question is described in Appendix A.
  • Figure 5: Training curves showing the progression of three reward signals—recursively confidence increase reward, last answer’s confidence reward, and accuracy reward—over iterations during GRPO shao2024deepseekmath. All rewards show consistent upward trends, indicating effective recursive refinement.
  • ...and 2 more figures