Coalitions of Large Language Models Increase the Robustness of AI Agents

Prattyush Mangal; Carol Mak; Theo Kanakis; Timothy Donovan; Dave Braines; Edward Pyzer-Knapp

Coalitions of Large Language Models Increase the Robustness of AI Agents

Prattyush Mangal, Carol Mak, Theo Kanakis, Timothy Donovan, Dave Braines, Edward Pyzer-Knapp

TL;DR

This paper investigates whether a coalition of open-source pretrained LLMs, each specialist for a sub-task in an agentic workflow, can surpass single-model or fine-tuned approaches in tool-use tasks. By decomposing workflows into planning, slot filling, and response formation, the authors assign each sub-task to the model best suited for it, demonstrating improved robustness and cost efficiency. Across ToolAlpaca benchmarks, the coalition outperforms fine-tuned baselines and single-model configurations, with notable per-task specialization advantages (e.g., Mistral for planning, Mixtral for slot filling, Flan UL2 for JSON RAG) and evidence that smaller models can beat larger ones on specific tasks. The work suggests that multi-model coalitions offer practical benefits for deploying cost-effective, flexible AI agents and motivates future exploration of coalitions that combine fine-tuned models for potential further gains.

Abstract

The emergence of Large Language Models (LLMs) have fundamentally altered the way we interact with digital systems and have led to the pursuit of LLM powered AI agents to assist in daily workflows. LLMs, whilst powerful and capable of demonstrating some emergent properties, are not logical reasoners and often struggle to perform well at all sub-tasks carried out by an AI agent to plan and execute a workflow. While existing studies tackle this lack of proficiency by generalised pretraining at a huge scale or by specialised fine-tuning for tool use, we assess if a system comprising of a coalition of pretrained LLMs, each exhibiting specialised performance at individual sub-tasks, can match the performance of single model agents. The coalition of models approach showcases its potential for building robustness and reducing the operational costs of these AI agents by leveraging traits exhibited by specific models. Our findings demonstrate that fine-tuning can be mitigated by considering a coalition of pretrained models and believe that this approach can be applied to other non-agentic systems which utilise LLMs.

Coalitions of Large Language Models Increase the Robustness of AI Agents

TL;DR

Abstract

Paper Structure (17 sections, 13 figures, 2 tables)

This paper contains 17 sections, 13 figures, 2 tables.

Introduction
Results
Coalitions of pretrained models outperform fine-tuned models
Coalitions outperform using single models
Specific models are better specific tasks
Model specialisation leads to accuracy improvements and cost savings
Conclusion
Experimental
Evaluating Planning
Evaluating Slot Filling (Parameter Inference)
Evaluating Procedural Accuracy
Evaluating System Responses
Evaluating JSON RAG and Critiquing
Data availability
Code availability
...and 2 more sections

Figures (13)

Figure 1: Timeline depicting major events in the utilisation of APIs.
Figure 2: Stages in a decomposed agentic workflow for API consumption as evaluated in this study.
Figure 3: An overview of the workflow to answer intents and queries by orchestrating calls to external tools. Sub-tasks involve planning tool usage, slot filling tool parameters and summarising the collected information to form a final natural language response to the initial query. Each sub-task is assigned to different LLMs to achieve more accurate workflow executions. Example tasks allocated to LLMs are identified by blue prompts and LLM responses are identified by purple messages.
Figure 4: Example test case from the ToolAlpaca test dataset requiring multiple finance domain APIs for task completion.
Figure 5: Assessing different models for JSON RAG. The models have been assessed against a custom dataset to determine which model offers the best performance at the JSON RAG task which is applied to filter long JSON responses from tool executions. Flan UL2 20B model outperforms all other model choices at this task.
...and 8 more figures

Coalitions of Large Language Models Increase the Robustness of AI Agents

TL;DR

Abstract

Coalitions of Large Language Models Increase the Robustness of AI Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (13)