Table of Contents
Fetching ...

Can Github issues be solved with Tree Of Thoughts?

Ricardo La Rosa, Corey Hulse, Bangdi Liu

TL;DR

The paper evaluates applying Tree of Thoughts (ToT) to the real-world problem of resolving GitHub issues within SWE-bench. It compares ToT to IO prompting and Retrieval Augmented Generation (RAG), finding that ToT alone does not yield the critical reasoning required for patch generation in a repository-wide context. The authors analyze causes of this shortfall, proposing directions such as deeper thought processes and agentic capabilities (e.g., tool use and code search) to improve performance. The work provides empirical insights and a roadmap for refining ToT-enabled approaches for practical software engineering tasks.

Abstract

While there have been extensive studies in code generation by large language models (LLM), where benchmarks like HumanEval have been surpassed with an impressive 96.3% success rate, these benchmarks predominantly judge a model's performance on basic function-level code generation and lack the critical thinking and concept of scope required of real-world scenarios such as solving GitHub issues. This research introduces the application of the Tree of Thoughts (ToT) language model reasoning framework for enhancing the decision-making and problem-solving abilities of LLMs for this complex task. Compared to traditional input-output (IO) prompting and Retrieval Augmented Generation (RAG) techniques, ToT is designed to improve performance by facilitating a structured exploration of multiple reasoning trajectories and enabling self-assessment of potential solutions. We experimentally deploy ToT in tackling a Github issue contained within an instance of the SWE-bench. However, our results reveal that the ToT framework alone is not enough to give LLMs the critical reasoning capabilities to outperform existing methods. In this paper we analyze the potential causes of these shortcomings and identify key areas for improvement such as deepening the thought process and introducing agentic capabilities. The insights of this research are aimed at informing future directions for refining the application of ToT and better harnessing the potential of LLMs in real-world problem-solving scenarios.

Can Github issues be solved with Tree Of Thoughts?

TL;DR

The paper evaluates applying Tree of Thoughts (ToT) to the real-world problem of resolving GitHub issues within SWE-bench. It compares ToT to IO prompting and Retrieval Augmented Generation (RAG), finding that ToT alone does not yield the critical reasoning required for patch generation in a repository-wide context. The authors analyze causes of this shortfall, proposing directions such as deeper thought processes and agentic capabilities (e.g., tool use and code search) to improve performance. The work provides empirical insights and a roadmap for refining ToT-enabled approaches for practical software engineering tasks.

Abstract

While there have been extensive studies in code generation by large language models (LLM), where benchmarks like HumanEval have been surpassed with an impressive 96.3% success rate, these benchmarks predominantly judge a model's performance on basic function-level code generation and lack the critical thinking and concept of scope required of real-world scenarios such as solving GitHub issues. This research introduces the application of the Tree of Thoughts (ToT) language model reasoning framework for enhancing the decision-making and problem-solving abilities of LLMs for this complex task. Compared to traditional input-output (IO) prompting and Retrieval Augmented Generation (RAG) techniques, ToT is designed to improve performance by facilitating a structured exploration of multiple reasoning trajectories and enabling self-assessment of potential solutions. We experimentally deploy ToT in tackling a Github issue contained within an instance of the SWE-bench. However, our results reveal that the ToT framework alone is not enough to give LLMs the critical reasoning capabilities to outperform existing methods. In this paper we analyze the potential causes of these shortcomings and identify key areas for improvement such as deepening the thought process and introducing agentic capabilities. The insights of this research are aimed at informing future directions for refining the application of ToT and better harnessing the potential of LLMs in real-world problem-solving scenarios.
Paper Structure (19 sections, 2 figures, 7 tables)

This paper contains 19 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: ToT setup with n = 5, k = 5 and b = 1
  • Figure 2: SWE-bench Evaluation jimenez2024swebench