Table of Contents
Fetching ...

Optimizing Model Selection for Compound AI Systems

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica

TL;DR

This work tackles MSP, the problem of optimally assigning among a pool of LLMs to the modules of a static compound AI system. It introduces LLMSelector, which iteratively assigns modules the best-per-module model based on an LLM diagnoser and leverages end-to-end monotonicity to ensure global improvements with bounded search. Under mild monotonicity assumptions, the approach converges to an optimal allocation, and experiments with real LLMs across self-refine, multi-agent-debate, and locate-solve demonstrate consistent improvements of 5–70% over uniform-model baselines and competitiveness with, or superiority to, prompt-optimization baselines. The results underscore the practical importance of model selection for compound AI systems and provide open-source code and data to promote further research in this direction, with potential broad impact on multi-stage AI pipelines that decompose complex tasks into sub-tasks.

Abstract

Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.

Optimizing Model Selection for Compound AI Systems

TL;DR

This work tackles MSP, the problem of optimally assigning among a pool of LLMs to the modules of a static compound AI system. It introduces LLMSelector, which iteratively assigns modules the best-per-module model based on an LLM diagnoser and leverages end-to-end monotonicity to ensure global improvements with bounded search. Under mild monotonicity assumptions, the approach converges to an optimal allocation, and experiments with real LLMs across self-refine, multi-agent-debate, and locate-solve demonstrate consistent improvements of 5–70% over uniform-model baselines and competitiveness with, or superiority to, prompt-optimization baselines. The results underscore the practical importance of model selection for compound AI systems and provide open-source code and data to promote further research in this direction, with potential broad impact on multi-stage AI pipelines that decompose complex tasks into sub-tasks.

Abstract

Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.

Paper Structure

This paper contains 43 sections, 1 theorem, 3 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

Suppose for each task $z$ in $\mathcal{D}_{Tr}$, the optimal allocation is unique. Then Algorithm alg:deluxeagent:algorithm converges to the optimal allocation on the training data after $L$ iterations.

Figures (7)

  • Figure 1: LLMSelector outperforms compound AI systems that always call the same LLM. Here we study three compound systems, namely, self-refine (on LiveCodeBench and GCH), multi-agent-debate (on SimpleQA and FEVER), and locate-solve (on TableArithmetic and TableBias). LLMSelector achieves 5%-70% accuracy gains over allocating any model alone by allocating different models to different modules in these compound systems.
  • Figure 2: Examples of static compound AI systems. (a) self-refine system. (b) multi-agent-debate system. The diamond and star represent the input and output modules, and the circles represent the LLM modules. Directed lines represent data flow, and we omit most query inputs for simplicity.
  • Figure 3: LLMSelector Workflow. LLMSelector takes as input a compound AI system, a pool of candidate LLMs, a training dataset consisting of question-answer pairs, and a training budget. Then LLMSelector iteratively nominates one module and allocates to it the model with the highest module-wise performance estimated by an LLM. This is repeated until the budget is reached or no performance gain is possible. Finally, LLMSelector returns an optimized model allocation.
  • Figure 4: A case study on the TableArithmetic dataset. (a) Overall performance. Any single LLM has low performance on either Module 1 (e.g., Claude 3.5) or Module 2 (e.g., Gemini 1.5 Pro), but not both. LLMSelector learns to use the best LLM for each module and thus achieves high performance on both modules and thus the whole system. (b) An example. Claude 3.5 fails to answer the extracted task correctly, while Gemini 1.5 cannot extract the correct task. LLMSelector allocates them in different modules to obtain the correct answer 49. (c) Optimizer's effect. (c1) LLMSelector reduces 60% cost to reach the same accuracy as the exhaustive search. (c2) Greedy search's accuracy is surprisingly low because of the locally optimal solution. (c3) LLM diagnoser enables LLMSelector to escape the local optimum.
  • Figure 5: The architectures of the compound AI systems studied in the experiments. (a) locate-solve consisting of two modules. (b) self-refine using three modules. (c) multi-agent-debate that involves six modules in total.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof