Table of Contents
Fetching ...

ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments

Pedro Gimenes, Zeyu Cao, Jeffrey Wong, Yiren Zhao

TL;DR

ARIES reframes reasoning as autonomous exploration over thought graphs by modeling transformations as actions in a Markov decision process and using a policy LLM to plan steps while a reasoning LLM executes them. It eliminates task-specific static schedules, achieving improved accuracy on HumanEval and reduced inference cost across benchmarks. The work introduces a multi-agent framework where two LLMs collaborate to decompose, solve, refine, reduce, and aggregate subproblems, guided by in-context planning and action ensembles. Empirical results on HumanEval, sorting, and set-intersection demonstrate robust gains and reveal failure modes tied to model size and decomposition depth, highlighting practical considerations for scaling autonomous LLM reasoning.

Abstract

Recent research has shown that LLM performance on reasoning tasks can be enhanced by scaling test-time compute. One promising approach, particularly with decomposable problems, involves arranging intermediate solutions as a graph on which transformations are performed to explore the solution space. However, prior works rely on pre-determined, task-specific transformation schedules which are subject to a set of searched hyperparameters. In this work, we view thought graph transformations as actions in a Markov decision process, and implement policy agents to drive effective action policies for the underlying reasoning LLM agent. In particular, we investigate the ability for another LLM to act as a policy agent on thought graph environments and introduce ARIES, a multi-agent architecture for reasoning with LLMs. In ARIES, reasoning LLM agents solve decomposed subproblems, while policy LLM agents maintain visibility of the thought graph states, and dynamically adapt the problem-solving strategy. Through extensive experiments, we observe that using off-the-shelf LLMs as policy agents with no supervised fine-tuning (SFT) can yield up to $29\%$ higher accuracy on HumanEval relative to static transformation schedules, as well as reducing inference costs by $35\%$ and avoid any search requirements. We also conduct a thorough analysis of observed failure modes, highlighting that limitations on LLM sizes and the depth of problem decomposition can be seen as challenges to scaling LLM-guided reasoning.

ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments

TL;DR

ARIES reframes reasoning as autonomous exploration over thought graphs by modeling transformations as actions in a Markov decision process and using a policy LLM to plan steps while a reasoning LLM executes them. It eliminates task-specific static schedules, achieving improved accuracy on HumanEval and reduced inference cost across benchmarks. The work introduces a multi-agent framework where two LLMs collaborate to decompose, solve, refine, reduce, and aggregate subproblems, guided by in-context planning and action ensembles. Empirical results on HumanEval, sorting, and set-intersection demonstrate robust gains and reveal failure modes tied to model size and decomposition depth, highlighting practical considerations for scaling autonomous LLM reasoning.

Abstract

Recent research has shown that LLM performance on reasoning tasks can be enhanced by scaling test-time compute. One promising approach, particularly with decomposable problems, involves arranging intermediate solutions as a graph on which transformations are performed to explore the solution space. However, prior works rely on pre-determined, task-specific transformation schedules which are subject to a set of searched hyperparameters. In this work, we view thought graph transformations as actions in a Markov decision process, and implement policy agents to drive effective action policies for the underlying reasoning LLM agent. In particular, we investigate the ability for another LLM to act as a policy agent on thought graph environments and introduce ARIES, a multi-agent architecture for reasoning with LLMs. In ARIES, reasoning LLM agents solve decomposed subproblems, while policy LLM agents maintain visibility of the thought graph states, and dynamically adapt the problem-solving strategy. Through extensive experiments, we observe that using off-the-shelf LLMs as policy agents with no supervised fine-tuning (SFT) can yield up to higher accuracy on HumanEval relative to static transformation schedules, as well as reducing inference costs by and avoid any search requirements. We also conduct a thorough analysis of observed failure modes, highlighting that limitations on LLM sizes and the depth of problem decomposition can be seen as challenges to scaling LLM-guided reasoning.

Paper Structure

This paper contains 26 sections, 6 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: ARIES workflow in answering the HumanEval prompt: "Find the shortest palindrome that begins with a supplied string". The policy agent selects an action based on the thought graph state, which is executed by the reasoning agent. First, the split action generates a skeleton implementation calling yet-to-implement subfunctions, decomposing the problem. Then, the agent is instructed to generate a solution for each subfunction. Since one of the solutions doesn't pass its testcases, the reasoning agent is instructed to refine it based on execution feedback.
  • Figure 2: Multi-agent framework for reasoning over thought graphs. First, (1) the policy agent an action and subset of nodes given a prompt including (i-ii) general instructions and (iii-iv) an overview of the exploration state. The sample is then (2) passed to the reasoning agent, which finally (3) updates the thought graph state.
  • Figure 3: Pareto frontiers in total query cost ($C_{s+i}$) and task error ($\mathcal{E}$) for set intersection tasks at various difficulty levels. The total cost is the number of queries expended at search and inference time. Llama-3.1-405B was used for the reasoning and policy agents. Our results (ARIES) have pushed the Pareto frontiers forward in each task.
  • Figure 4: Mean error (y-axis) obtained in the sorting32 task over a sweep of ensemble sizes (x-axis). Llama-3.1-70B was used as the policy agent.