Table of Contents
Fetching ...

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Nearchos Potamitis, Akhil Arora

TL;DR

The paper investigates whether retrials without explicit feedback can match or exceed the performance of more complex, self-reflective prompting strategies for LLM reasoning under budget constraints. It introduces retrials as a simple mechanism that re-attempts problem solving after incorrect outputs without verbalized introspection, and systematically compares it against IO, CoT, ToT, and Reflexion across GPT-4o-mini and Llama-3.3-70B on benchmarks like Game of 24, HumanEval, and HotpotQA. Across tasks and models, the authors find that simpler approaches, particularly CoT, often deliver superior cost-efficiency, while some complex methods may only excel with larger budgets or favorable conditions. These results challenge the notion that more intricate reasoning frameworks are inherently better and highlight the importance of budget-aware design in LLM-based problem solving.

Abstract

Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

TL;DR

The paper investigates whether retrials without explicit feedback can match or exceed the performance of more complex, self-reflective prompting strategies for LLM reasoning under budget constraints. It introduces retrials as a simple mechanism that re-attempts problem solving after incorrect outputs without verbalized introspection, and systematically compares it against IO, CoT, ToT, and Reflexion across GPT-4o-mini and Llama-3.3-70B on benchmarks like Game of 24, HumanEval, and HotpotQA. Across tasks and models, the authors find that simpler approaches, particularly CoT, often deliver superior cost-efficiency, while some complex methods may only excel with larger budgets or favorable conditions. These results challenge the notion that more intricate reasoning frameworks are inherently better and highlight the importance of budget-aware design in LLM-based problem solving.

Abstract

Recent advancements in large language models (LLMs) have catalyzed the development of general-purpose autonomous agents, demonstrating remarkable performance in complex reasoning tasks across various domains. This surge has spurred the evolution of a plethora of prompt-based reasoning frameworks. A recent focus has been on iterative reasoning strategies that refine outputs through self-evaluation and verbalized feedback. However, these strategies require additional computational complexity to enable models to recognize and correct their mistakes, leading to a significant increase in their cost. In this work, we introduce the concept of ``retrials without feedback'', an embarrassingly simple yet powerful mechanism for enhancing reasoning frameworks by allowing LLMs to retry problem-solving attempts upon identifying incorrect answers. Unlike conventional iterative refinement methods, our method does not require explicit self-reflection or verbalized feedback, simplifying the refinement process. Our findings indicate that simpler retrial-based approaches often outperform more sophisticated reasoning frameworks, suggesting that the benefits of complex methods may not always justify their computational costs. By challenging the prevailing assumption that more intricate reasoning strategies inherently lead to better performance, our work offers new insights into how simpler, more efficient approaches can achieve optimal results. So, are retrials all you need?

Paper Structure

This paper contains 11 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the base model. Within the indicated budget, simpler methods outperform more complex ones while remaining cost-efficient.
  • Figure 2: Comparing the cost-quality trade-off of CoT and ToT across different temperature levels using GPT-4o-mini as the base model. For CoT success rate is strictly increasing as temperature increases and so does for ToT but not strictly.
  • Figure 3: Comparing the cost-quality trade-off of CoT and ToT, using Llama-3.3-70B as the base model, across different temperature levels.
  • Figure 4: Comparing the cost-quality trade-off of IO, CoT, ToT, and Reflexion using Llama-3.3-70B as the base model. Within the indicated budget, simpler methods have similar or better performance complex ones while remaining cost-efficient.
  • Figure 5: Comparing the sample-quality trade-off of IO, CoT, ToT, and Reflexion using GPT-4o-mini as the base model. Within the indicated budget, simpler methods outperform more complex ones while they remain sample-efficient only for the case of the HumanEval task.
  • ...and 1 more figures