Grade Score: Quantifying LLM Performance in Option Selection
Dmitri Iourovitski
TL;DR
This work introduces Grade Score, a metric that fuses entropy-based bias quantification with mode-based selection stability to evaluate LLMs as judges of multiple-choice options. It formalizes the score via $LLM_Score = H(X)/H_{max}$, $Choice_Score = m/N$, and $Grade_Score = 2*(LLM_Score*Choice_Score)/(LLM_Score+Choice_Score)$, and investigates prompt engineering and option sampling to mitigate order bias. Using Monte Carlo permutation trials on the Open Assistant (OASST) dataset, the study shows that random option inclusion and carefully designed prompts can enhance fairness and reliability, while revealing emergent instruction-following adaption to bias-directed prompts. The results guide design of robust, fair LLM judging systems and point to future work on expanding prompts, evaluation domains, and scalability of the Grade Score framework.
Abstract
This study introduces the "Grade Score", a novel metric designed to evaluate the consistency and fairness of Large Language Models (LLMs) when used as multiple-choice judges with respect to order bias and choice consistency. The Grade Score combines Entropy, which measures order bias, and Mode Frequency, which assesses choice stability, offering insights into LLMs' reliability and impartiality. The study explores techniques such as prompt engineering and option sampling strategies to optimize the Grade Score, demonstrating their effectiveness in enhancing LLMs' performance. Results showcase varying performances among LLMs with respect to prompts and highlight the positive impact of including irrelevant options. The study also identifies an emergent behavior in instruction-following models, where they adapt to instructions targeting specific biases, demonstrating their adaptability. The Grade Score facilitates comparisons between LLMs and encourages ongoing research towards optimizing their decision-making processes, with potential implications for improving their reliability and fairness in various applications. All code is available on GitHub https://github.com/IoDmitri/GradeLab
