Grade Score: Quantifying LLM Performance in Option Selection

Dmitri Iourovitski

Grade Score: Quantifying LLM Performance in Option Selection

Dmitri Iourovitski

TL;DR

This work introduces Grade Score, a metric that fuses entropy-based bias quantification with mode-based selection stability to evaluate LLMs as judges of multiple-choice options. It formalizes the score via $LLM_Score = H(X)/H_{max}$, $Choice_Score = m/N$, and $Grade_Score = 2*(LLM_Score*Choice_Score)/(LLM_Score+Choice_Score)$, and investigates prompt engineering and option sampling to mitigate order bias. Using Monte Carlo permutation trials on the Open Assistant (OASST) dataset, the study shows that random option inclusion and carefully designed prompts can enhance fairness and reliability, while revealing emergent instruction-following adaption to bias-directed prompts. The results guide design of robust, fair LLM judging systems and point to future work on expanding prompts, evaluation domains, and scalability of the Grade Score framework.

Abstract

This study introduces the "Grade Score", a novel metric designed to evaluate the consistency and fairness of Large Language Models (LLMs) when used as multiple-choice judges with respect to order bias and choice consistency. The Grade Score combines Entropy, which measures order bias, and Mode Frequency, which assesses choice stability, offering insights into LLMs' reliability and impartiality. The study explores techniques such as prompt engineering and option sampling strategies to optimize the Grade Score, demonstrating their effectiveness in enhancing LLMs' performance. Results showcase varying performances among LLMs with respect to prompts and highlight the positive impact of including irrelevant options. The study also identifies an emergent behavior in instruction-following models, where they adapt to instructions targeting specific biases, demonstrating their adaptability. The Grade Score facilitates comparisons between LLMs and encourages ongoing research towards optimizing their decision-making processes, with potential implications for improving their reliability and fairness in various applications. All code is available on GitHub https://github.com/IoDmitri/GradeLab

Grade Score: Quantifying LLM Performance in Option Selection

TL;DR

, and

, and investigates prompt engineering and option sampling to mitigate order bias. Using Monte Carlo permutation trials on the Open Assistant (OASST) dataset, the study shows that random option inclusion and carefully designed prompts can enhance fairness and reliability, while revealing emergent instruction-following adaption to bias-directed prompts. The results guide design of robust, fair LLM judging systems and point to future work on expanding prompts, evaluation domains, and scalability of the Grade Score framework.

Abstract

Paper Structure (30 sections, 6 equations, 6 figures, 3 tables)

This paper contains 30 sections, 6 equations, 6 figures, 3 tables.

Introduction
Related Works
Order bias in Large Language Models
Existing order bias mitigation strategies and their limitations
Instruction-following and explicit thinking strategies
Experimentation Methodology
Dataset selection
Option randomization Algorithm
Monte Carlo trials under permutation
Grade Score Formulation
Evaluation Framework for LLM Selection Stability
LLM Score: Entropy as a Measure of Bias
Choice Score: Mode Frequency as a Measure of Stability
Grade Score: Combining LLM Score and Choice Score
Prompting for LLM Selection
...and 15 more sections

Figures (6)

Figure 1: Comparison of consistent and biased LLM choices. (Left) Input 1 is consistently selected across order permutations. (Right) A perfectly biased LLM always selects the first input.
Figure 2: Unrelated Output Sampling: An unrelated option is added to the option set from a randomly sampled example.
Figure 3: Prompt 1 is a simple prompt that asks a Large Language Model to first select an option, and to finally provide an explanation.
Figure 4: Prompt 2 is designed to have an LLM first come up with an evaluation for each option, and then make a selection.
Figure 5: Prompt 3 focuses on explicitly instructing LLMs to avoid order bias, testing instruction following capabilities for LLMs and seeing if it can mitigate various forms of biases.
...and 1 more figures

Grade Score: Quantifying LLM Performance in Option Selection

TL;DR

Abstract

Grade Score: Quantifying LLM Performance in Option Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)