AutoMix: Automatically Mixing Language Models

Pranjal Aggarwal; Aman Madaan; Ankit Anand; Srividya Pranavi Potharaju; Swaroop Mishra; Pei Zhou; Aditya Gupta; Dheeraj Rajagopal; Karthik Kappaganthu; Yiming Yang; Shyam Upadhyay; Manaal Faruqui; Mausam

AutoMix: Automatically Mixing Language Models

Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam

TL;DR

AutoMix tackles the challenge of leveraging multiple black-box LLMs under budget constraints by combining a small-model solution, context-grounded few-shot self-verification, and two routing strategies (thresholding and a POMDP-based router). The key innovations are a verification signal framed as entailment and a principled router that handles noisy verifier outputs to route queries across $N$ models with differing costs and capabilities. The approach yields consistent cost-performance gains across five datasets and five models, supported by the ibc metric and geometric interpretation, and scales to three-model settings with robust improvements. AutoMix demonstrates significant practical impact by reducing inference costs while maintaining or improving performance, and it opens avenues for broader applicability with black-box LLM APIs and low-data regimes.

Abstract

Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.

AutoMix: Automatically Mixing Language Models

TL;DR

models with differing costs and capabilities. The approach yields consistent cost-performance gains across five datasets and five models, supported by the ibc metric and geometric interpretation, and scales to three-model settings with robust improvements. AutoMix demonstrates significant practical impact by reducing inference costs while maintaining or improving performance, and it opens avenues for broader applicability with black-box LLM APIs and low-data regimes.

Abstract

Paper Structure (47 sections, 6 equations, 24 figures, 5 tables, 1 algorithm)

This paper contains 47 sections, 6 equations, 24 figures, 5 tables, 1 algorithm.

Introduction
Background and Related Work
Problem Formulation
AutoMix
Self-Verification
Router
Thresholding
POMDP-based Router
Experiments
A Metric for Cost-Performance Efficiency Analysis
Incremental Benefit Per Cost (ibc)
Geometric Interpretation
Setup
Models and Cost Calculation
Datasets
...and 32 more sections

Figures (24)

Figure 1: Representative example for 2 model setup in AutoMix. Instead of relying only on small model (SLM) with low performance or a large model (LLM) with high cost, AutoMix automatically mixes multiple black-box language models, based on user desired cost-quality tradeoff. AutoMix works in a 3-step process: 1.) generation by a small model ($LM_1$), 2.) self-verification of the generated answer, 3.) using confidence assessments from self-verification to do appropriate routing to a larger model ($LM_2$). For N-model setup, the process is repeated till the final answer is reported.
Figure 2: Verification Prompt. The verification process is framed as a natural language entailment task, where the model determines the validity of the model-generated answer with respect to the context and question. We use a generic few-shot prompt for all tasks.
Figure 3: Context-Grounded Self-Verification using llama2-13b in Action. The example showcases the verifier, utilizing the same model as the answer generator, identifying and rejecting an inaccurate answer—He took it in 1990—by effectively leveraging the context.
Figure 4: Left:AutoMix algorithm. Right: Performance vs. Cost curve. The slope between SLM and LLM provides a way to the Incremental Benefit per Cost (ibc) for methods that mix models. Methods with a steeper slope than this reference when plotted against SLM have a positive ibc (green region), whereas those below the reference have a negative ibc (red region).
Figure 5: Main Results: performance (y-axis) vs. cost (x-axis) for different methods on the small and large mistral-7b/GPT-4. POMDP based meta-verifier is consistently above the linear interpolation (random mixing) of SLM-LLM, signifying a higher incremental benefit per unit cost (ibc).
...and 19 more figures

AutoMix: Automatically Mixing Language Models

TL;DR

Abstract

AutoMix: Automatically Mixing Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)