Table of Contents
Fetching ...

Fleet of Agents: Coordinated Problem Solving with Large Language Models

Lars Klein, Nearchos Potamitis, Roland Aydin, Robert West, Caglar Gulcehre, Akhil Arora

TL;DR

FoA addresses the trade-off between solution quality and inference cost in LLM-based reasoning by deploying a fleet of $n$ agents that autonomously explore a search space for $k$ steps, followed by a genetic-type particle filtering resampling driven by a value function and a discount factor $ abla$ to enable dynamic branching. This runtime framework reduces costs while maintaining or improving solution quality across sequential decision tasks, demonstrated on Game of 24, Mini Crosswords, and WebShop with multiple base models, including GPT-4 and LLaMA variants. Key findings show FoA achieving on-average ~5% quality improvement at ~40% of prior SOTA costs, with smaller models (e.g., LLaMA 11B) sometimes surpassing larger ones when paired with FoA. The approach is prompt-agnostic and tunable via $n$, $k$, and $ abla$, offering predictable latency and broad practical applicability, and it is publicly released to encourage further exploration of genetic particle filtering in AI agents.

Abstract

While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the trade-off between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on three benchmark tasks, ``Game of 24'', ``Mini-Crosswords'', and ``WebShop'', utilizing four different LLMs, ``GPT-3.5'', ``GPT-4'', ``LLaMA3.2-11B'', and ``LLaMA3.2-90B''. On average across all tasks and LLMs, FoA obtains a quality improvement of ~5% while requiring only ~40% of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods and (2) FoA + LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available at https://github.com/au-clan/FoA.

Fleet of Agents: Coordinated Problem Solving with Large Language Models

TL;DR

FoA addresses the trade-off between solution quality and inference cost in LLM-based reasoning by deploying a fleet of agents that autonomously explore a search space for steps, followed by a genetic-type particle filtering resampling driven by a value function and a discount factor to enable dynamic branching. This runtime framework reduces costs while maintaining or improving solution quality across sequential decision tasks, demonstrated on Game of 24, Mini Crosswords, and WebShop with multiple base models, including GPT-4 and LLaMA variants. Key findings show FoA achieving on-average ~5% quality improvement at ~40% of prior SOTA costs, with smaller models (e.g., LLaMA 11B) sometimes surpassing larger ones when paired with FoA. The approach is prompt-agnostic and tunable via , , and , offering predictable latency and broad practical applicability, and it is publicly released to encourage further exploration of genetic particle filtering in AI agents.

Abstract

While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the trade-off between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on three benchmark tasks, ``Game of 24'', ``Mini-Crosswords'', and ``WebShop'', utilizing four different LLMs, ``GPT-3.5'', ``GPT-4'', ``LLaMA3.2-11B'', and ``LLaMA3.2-90B''. On average across all tasks and LLMs, FoA obtains a quality improvement of ~5% while requiring only ~40% of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods and (2) FoA + LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available at https://github.com/au-clan/FoA.
Paper Structure (34 sections, 2 equations, 12 figures, 8 tables, 3 algorithms)

This paper contains 34 sections, 2 equations, 12 figures, 8 tables, 3 algorithms.

Figures (12)

  • Figure 1: Analyzing the trade-off between cost and quality of representative SOTA methods with GPT-3.5 on the Game of 24 task. FoA achieves the best cost-quality trade-off.
  • Figure 2: Comparison between SOTA tree-search-based reasoning totgotlatsrap_reasoner and our FoA frameworks. FoA offers precise control over the tree width ($n$ agents) and depth ($t$ steps), leading to predictable latency and cost. However, by expanding the $c$ most promising states at each step, tree-search methods offer no such control and their search trees might grow exponentially.
  • Figure 3: Fleet-of-Agents (FoA) comprising $n=5$ agents that think autonomously for $k$ steps and are then resampled to focus the search on promising regions.
  • Figure 4: Comparing (Left) quality and (Right) cost of FoA with the second most efficacious method (labeled SOTA in the plot) on each benchmark task.
  • Figure 5: Evaluating the trade-off between (Left) model size and quality on the benchmarked tasks with Llama3.2-11B and 90B as base models, and (Right) cost and quality of representative SOTA methods with GPT-3.5 on Game of 24.
  • ...and 7 more figures