Table of Contents
Fetching ...

Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing

Boyuan Chen, Mingzhi Zhu, Brendan Dolan-Gavitt, Muhammad Shafique, Siddharth Garg

TL;DR

The paper addresses the cost-accuracy tension in LLM-based code generation by introducing a cascaded, multi-model framework with self-testing that operates in a black-box setting. It escalates from smaller to larger models based on a thresholded quality score derived from self-generated tests, and it learns Pareto-optimal configurations from a validation set to suit varying budgets. Key contributions include a formalized cascading pipeline, a Pareto-based optimization of (k,l,theta) parameters, and extensive experiments across multiple open-source model families and datasets, achieving up to $70\%$ cost reductions with maintained accuracy. This approach enables cost-efficient, adaptable code-completion servers and is compatible with various model families and potential speculative-decoding integrations for future improvements.

Abstract

The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.

Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing

TL;DR

The paper addresses the cost-accuracy tension in LLM-based code generation by introducing a cascaded, multi-model framework with self-testing that operates in a black-box setting. It escalates from smaller to larger models based on a thresholded quality score derived from self-generated tests, and it learns Pareto-optimal configurations from a validation set to suit varying budgets. Key contributions include a formalized cascading pipeline, a Pareto-based optimization of (k,l,theta) parameters, and extensive experiments across multiple open-source model families and datasets, achieving up to cost reductions with maintained accuracy. This approach enables cost-efficient, adaptable code-completion servers and is compatible with various model families and potential speculative-decoding integrations for future improvements.

Abstract

The rapid advancement of large language models (LLMs) has significantly improved code completion tasks, yet the trade-off between accuracy and computational cost remains a critical challenge. While using larger models and incorporating inference-time self-testing algorithms can significantly improve output accuracy, they incur substantial computational expenses at the same time. Furthermore, servers in real-world scenarios usually have a dynamic preference on the cost-accuracy tradeoff, depending on the budget, bandwidth, the concurrent user volume, and users' sensitivity to wrong answers. In this work, we introduce a novel framework combining model cascading and inference-time self-feedback algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff in LLM-based code generation. Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions. As a blackbox inference-time method, it requires no access to internal model parameters. We further propose a threshold-based algorithm to determine when to deploy larger models and a heuristic to optimize the number of solutions, test cases, and test lines generated per model, based on budget constraints. Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case, across various model families and datasets, while maintaining or improving accuracy in natural language generation tasks compared to both random and optimal single-model self-testing schemes. To our knowledge, this is the first work to provide a series of choices for optimizing the cost-accuracy trade-off in LLM code generation with self-testing.
Paper Structure (20 sections, 2 equations, 8 figures, 2 tables)

This paper contains 20 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Results of our model cascading scheme compared to the randomly selected single-model self-testing scheme at a given accuracy on three model families solving the HumanEval dataset. The curves are derived via PCHIP interpolation from test set results in different parameter combinations, each represented as a dot in the cost-accuracy plot. On average, our scheme saves 70%, 11%, and 17% of cost across all accuracy ranges on each model family. Detailed computation is provided in Section \ref{['sec:compare_avg']}.
  • Figure 2: Cost and pass@1 accuracy with greedy search of all models on each dataset. Cost is estimated as dollars per one million tokens ($/1M tokens) averaged on the three datasets. See details of cost calculation in Section \ref{['sec:cost_calculation']}. The APPS dataset refers to the introductory questions in the test set only.
  • Figure 3: Venn diagram of WizardCoder-Python-V1.0 models (7B, 13B, 34B) on answering HumanEval prompts using greedy search. Out of 164 prompts, numbers indicate questions correctly answered by one, two, or all three models. Examples: 21 by 13B and 34B, 2 only by 7B, 75 by all. 33 questions were unsolved by any model.
  • Figure 4: An overview of our proposed model cascading solution with $n$ models. Models with higher indices are larger in size. Starting with model 1, we generate multiple code solutions and testcases. We score each solution-test pair and identify the best pair; if the score exceeds a threshold, we accept the solution and output; otherwise, we move on to the next model and repeat the process, until we take the highest-scored output from the largest model.
  • Figure 5: Solving MBPP question 57 "Find Max Num" with the WizardCoder-Python-V1.0 family of three model sizes: 7B, 13B and 34B. The threshold parameter in this case is threshold parameter $\theta=0.5$. The question is first passed to the 7B model, which generates $k_1=3$ solutions and test cases, each test case including $l_1=2$ test lines. There are $3 \times 2 \times 3=18$ solution-test pairs in total, and 0 pair passes, which is below the threshold $0.5 \times 18 = 9$. We thus pass the question to the next model in line, 13B. It generates $k_2=1$ solutions and test cases, each test case including $l_2=2$ test lines. There are $1 \times 1 \times 2 = 2$ solution-test pairs in total, and 2 of them pass, which is above the threshold. We therefore take the solution and exit. The biggest model with 34B parameter is skipped.
  • ...and 3 more figures