Tree-of-Code: A Hybrid Approach for Robust Complex Task Planning and Execution
Ziyi Ni, Yifan Li, Daxiang Dong
TL;DR
This work addresses instability and hallucinations in LLM-driven task planning that rely on stepwise code generation (CodeAct). It introduces Tree-of-Code (ToC), a hybrid framework that unifies tree-structured thought exploration with end-to-end code generation and execution, using a BFS-like expansion of code-execution nodes and majority voting to select robust outcomes. By formalizing thought-to-code transformation with an llm-function and treating code as reasoning, ToC achieves improved accuracy and reduced interaction steps on a complex multi-scene benchmark (M3ToolEval) without requiring model fine-tuning. The results demonstrate stronger robustness in complex reasoning tasks and highlight the practical potential of integrating diverse LLMs through structured reflection and execution feedback, with future work aimed at few-shot supervised fine-tuning and real-world deployment.
Abstract
The exceptional capabilities of large language models (LLMs) have substantially accelerated the rapid rise and widespread adoption of agents. Recent studies have demonstrated that generating Python code to consolidate LLM-based agents' actions into a unified action space (CodeAct) is a promising approach for developing real-world LLM agents. However, this step-by-step code generation approach often lacks consistency and robustness, leading to instability in agent applications, particularly for complex reasoning and out-of-domain tasks. In this paper, we propose a novel approach called Tree-of-Code (ToC) to tackle the challenges of complex problem planning and execution with an end-to-end mechanism. By integrating key ideas from both Tree-of-Thought and CodeAct, ToC combines their strengths to enhance solution exploration. In our framework, each final code execution result is treated as a node in the decision tree, with a breadth-first search strategy employed to explore potential solutions. The final outcome is determined through a voting mechanism based on the outputs of the nodes.
