Table of Contents
Fetching ...

AIME: AI System Optimization via Multiple LLM Evaluators

Bhrij Patel, Souradip Chakraborty, Wesley A. Suttle, Mengdi Wang, Amrit Singh Bedi, Dinesh Manocha

TL;DR

The paper tackles the limitation of single-LLM evaluations in text-based AI system optimization for complex tasks like code generation. It introduces AIME, a multi-evaluator protocol where independent role-specific evaluators generate separate assessments that are concatenated to guide iteration, supported by a theory linking evaluator count to reduced suboptimality via a linear-additivity framework. Empirically, AIME yields substantial gains over single-evaluator approaches in error detection (EDR up to 62%) and in task performance (SR/CR improvements up to ~13–18%) on LeetCodeHard and HumanEval, with ablations showing the importance of evaluator diversity and role selection. The work also demonstrates robustness to adversarial evaluations and discusses practical design considerations and limitations, outlining directions for extending to other tasks and more complex AI systems.

Abstract

Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration's output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to $62\%$ higher error detection rate and up to $16\%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to $12\%$.

AIME: AI System Optimization via Multiple LLM Evaluators

TL;DR

The paper tackles the limitation of single-LLM evaluations in text-based AI system optimization for complex tasks like code generation. It introduces AIME, a multi-evaluator protocol where independent role-specific evaluators generate separate assessments that are concatenated to guide iteration, supported by a theory linking evaluator count to reduced suboptimality via a linear-additivity framework. Empirically, AIME yields substantial gains over single-evaluator approaches in error detection (EDR up to 62%) and in task performance (SR/CR improvements up to ~13–18%) on LeetCodeHard and HumanEval, with ablations showing the importance of evaluator diversity and role selection. The work also demonstrates robustness to adversarial evaluations and discusses practical design considerations and limitations, outlining directions for extending to other tasks and more complex AI systems.

Abstract

Text-based AI system optimization typically involves a feedback loop scheme where a single LLM generates an evaluation in natural language of the current output to improve the next iteration's output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to higher error detection rate and up to higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to .
Paper Structure (19 sections, 1 theorem, 7 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 7 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $d_{\text{TV}}$ denote the total variation distance between two distributions and let $\sum_{k=1}^K \alpha_k = 1$. Assuming all pairs $\pi_1, \pi_2 \in \Pi$ are independent of one another,

Figures (10)

  • Figure 1: AI System Optimization Pipeline and Increased Error Detection and Success Rate with AIME-based Evaluation:[LEFT] Text-based AI system optimization with SoTA framework yuksekgonul2024textgrad using our multiple LLM evaluator approach AIME (orange) and with single-evaluator approach (blue). [TOP RIGHT] The single-evaluator approach cannot detect an error in the generated code that fails all test cases. However, one of the evaluators of AIME could because the logical evaluator was independent from the correctness evaluator. [BOTTOM RIGHT] AIME-based optimization achieves $\sim16\%$ higher success rate than a single-evaluator approach in code generation tasks.
  • Figure 2: Using LeetCodeHard and HumanEval benchmarks we compare evaluations generated from Single-Eval against those of AIME in terms of [LEFT] EDR and [RIGHT] RAE scores. AIME has a higher EDR score on both datasets indicating it is less prone to letting errors go undetected. AIME has a higher resistance to an adversarial evaluator on LeetCodeHard and a comparable resistance on HumanEval, suggesting its robustness over Single-Eval
  • Figure 3: Independent evaluator of AIME provides more thorough explanations: Example evaluations for readability generated by Single-Eval and AIME. Both evaluations are for the same coding task at the same iteration which failed all test cases. Even though both Single-Eval and AIME believe that the code is readable with no criticisms, AIME's readability comment is more thorough. This result may be because it was generated independently from evaluations of other criteria. Without having other to worry about other roles, the readability evaluator was allowed to focus its entire output on readability.
  • Figure 4: [BAR PLOT] Success Rate and Completion Rate and [LINE PLOT] Best Completion Rate over max number of iterations for [LEFT] LeetCodeHard and [RIGHT] HumanEval. Over $10$ iterations for each coding problem, AIME has the highest SR and CR over both datasets.
  • Figure 5: Increasing Number of Evaluator and Diversity Helps:[LEFT] When setting all the evaluators of AIME to the same role, correctness, and increasing the number of evaluators from $1 \to 3 \to 6$ increases EDR. This result shows that even if there is only one role, multiple independent evaluations can help catch errors. [RIGHT] With six evaluators, having 6 distinct roles has better SR, CR, and EDR, than all of the evaluators having the same role, correctness.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1