Optimizing Model Selection for Compound AI Systems
Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica
TL;DR
This work tackles MSP, the problem of optimally assigning among a pool of LLMs to the modules of a static compound AI system. It introduces LLMSelector, which iteratively assigns modules the best-per-module model based on an LLM diagnoser and leverages end-to-end monotonicity to ensure global improvements with bounded search. Under mild monotonicity assumptions, the approach converges to an optimal allocation, and experiments with real LLMs across self-refine, multi-agent-debate, and locate-solve demonstrate consistent improvements of 5–70% over uniform-model baselines and competitiveness with, or superiority to, prompt-optimization baselines. The results underscore the practical importance of model selection for compound AI systems and provide open-source code and data to promote further research in this direction, with potential broad impact on multi-stage AI pipelines that decompose complex tasks into sub-tasks.
Abstract
Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.
