Table of Contents
Fetching ...

Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Injae Na, Keonwoong Noh, Woohwan Jung

TL;DR

LLM-AT tackles the cost-performance trade-off of using multiple LLM tiers by introducing a training-free framework that automatically selects an initial tier and iteratively upgrades as needed. It combines a Starter to pick the starting tier, a Generator with prompt-based reasoning, and a Judge to validate outputs, augmented by a history-driven accuracy estimator that uses top-$k$ similar past queries and pseudo-labels to forecast tier performance without ground-truth labels. The accuracy estimator relies on a Bayesian-inspired prior and similarity weighting to adapt to question difficulty, enabling practical, low-overhead tier selection. Empirical results on MATH and MCQA show that LLM-AT achieves favorable accuracy-time and accuracy-cost trade-offs, outperforming single-model baselines and training-based routers, with demonstrated robustness to cold starts and performance reversals. The approach has practical impact for real-world LLM services by reducing operational costs while maintaining high-quality responses, and it can be extended to heterogeneous tier ecosystems and open-ended generation tasks in future work.

Abstract

LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.

Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

TL;DR

LLM-AT tackles the cost-performance trade-off of using multiple LLM tiers by introducing a training-free framework that automatically selects an initial tier and iteratively upgrades as needed. It combines a Starter to pick the starting tier, a Generator with prompt-based reasoning, and a Judge to validate outputs, augmented by a history-driven accuracy estimator that uses top- similar past queries and pseudo-labels to forecast tier performance without ground-truth labels. The accuracy estimator relies on a Bayesian-inspired prior and similarity weighting to adapt to question difficulty, enabling practical, low-overhead tier selection. Empirical results on MATH and MCQA show that LLM-AT achieves favorable accuracy-time and accuracy-cost trade-offs, outperforming single-model baselines and training-based routers, with demonstrated robustness to cold starts and performance reversals. The approach has practical impact for real-world LLM services by reducing operational costs while maintaining high-quality responses, and it can be extended to heterogeneous tier ecosystems and open-ended generation tasks in future work.

Abstract

LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.

Paper Structure

This paper contains 33 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: An example of the LLM-AT process with an input question. 'S', 'G', and 'J' indicate the Starter, Generator, and Judge, respectively.
  • Figure 2: An overview of LLM-AT Framework.
  • Figure 3: Main results. The marker shapes represent the LLM used (▲: o1, ●: o1-mini, ■: GPT-4o, ▼: GPT-4o-mini). For LLM-AT(red), each marker indicates the top-tier model. LLM-AT(Oracle) means the results obtained using an oracle judge.
  • Figure 4: Distribution of the estimated accuracy.
  • Figure 5: Tiers selected by LLM-AT in MATH based on question difficulty.
  • ...and 4 more figures