Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

Krishna Aswani; Huilin Lu; Pranav Patankar; Priya Dhalwani; Iris Tan; Jayant Ganeshmohan; Simon Lacasse

Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

Krishna Aswani, Huilin Lu, Pranav Patankar, Priya Dhalwani, Iris Tan, Jayant Ganeshmohan, Simon Lacasse

TL;DR

Auto-Evolve tackles the limitation of fixed seed reasoning modules in prompting by dynamically generating task-specific reasoning modules and iteratively refining a domain-adaptive reasoning structure encoded as JSON. The framework comprises three components ($GENERATE$, $IMPLEMENT$, $REFINE$) and a two-stage workflow that produces task-tailored instructions guiding LLMs without predefined seeds. Empirically, it achieves up to 10.4% absolute gains over CoT and roughly 6–7% average gains across Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT-4 on BBH, while reducing inference calls relative to ensemble methods. The results demonstrate enhanced reasoning flexibility, robustness across tasks, and transferability to open-source models, with potential for more interpretable, scalable LLM reasoning—and highlights avenues for future work on feedback-driven refinement and bias mitigation.

Abstract

Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strategies rely on single or fixed set of static seed reasoning modules like "think step by step" or "break down this problem" intended to simulate human approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we introduce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT 4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outperforms CoT by up to 10.4% and on an average by 7% across these four models. Our framework introduces two innovations: a) Auto-Evolve dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for predefined templates. b) We introduce an iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance by average 2.8% compared to doing it in a single step.

Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

TL;DR

) and a two-stage workflow that produces task-tailored instructions guiding LLMs without predefined seeds. Empirically, it achieves up to 10.4% absolute gains over CoT and roughly 6–7% average gains across Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT-4 on BBH, while reducing inference calls relative to ensemble methods. The results demonstrate enhanced reasoning flexibility, robustness across tasks, and transferability to open-source models, with potential for more interpretable, scalable LLM reasoning—and highlights avenues for future work on feedback-driven refinement and bias mitigation.

Abstract

Paper Structure (29 sections, 3 equations, 14 figures, 3 tables)

This paper contains 29 sections, 3 equations, 14 figures, 3 tables.

Introduction
Related work
Auto-Evolve Framework
Reasoning Module Generator (GENERATE)
Reasoning structure initializer (IMPLEMENT)
Reasoning structure evolver (REFINE)
Experiments
Datasets
Models
Baselines
Experiments setup and evaluation
Results and Discussion
Performance
Efficiency
Themes: Improvement across categories
...and 14 more sections

Figures (14)

Figure 1: Illustration of using Auto-Evolve workflow for problem-solving.
Figure 2: Overview of three components of Auto-Evolve Stage 1. Component Reasoning Module Generator GENERATE a set of task-specific reasoning modules and component Reasoning Structure Initializer IMPLEMENT a starting JSON reasoning structure. Over multiple runs of REFINE, component Reasoning Structure Evolver subsequently refines the reasoning structure to a domain-adaptive actionable plan.For instance, when solving the reasoning QA task, the initial reasoning structure from IMPLEMENT may lack depth in 'moral, intentional, or counterfactual analysis'. The REFINE process addresses this gap by identifying and incorporating these additional elements, thus improving the structure's ability to solve the task.
Figure 3: Overview of Auto-Evolve workflow in mathematical notation
Figure 4: Task level BBH performance on Mistral Large for Auto-Evolve over Direct Prompt, CoT and Self-Discover. Claude models and GPT-4 results are in Appendix \ref{['fig:acc_diff_gpt']} and \ref{['fig:acc_diff_Claude_2.0']}
Figure 5: Performance of Auto-Evolve on Claude 2.0 in four task categories
...and 9 more figures

Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

TL;DR

Abstract

Auto-Evolve: Enhancing Large Language Model's Performance via Self-Reasoning Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (14)