Table of Contents
Fetching ...

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano

TL;DR

This paper addresses the unreliability of MC benchmarks for LLM evaluation by introducing CoRA, a metric that adjusts MCQA scores according to response consistency observed over divergent distractor sets. CoRA computes BMCA and CI to quantify minimal consistency and the consistency gap, then rebalances MCQA with CI to yield a more faithful measure of knowledge. Across MedQA and general benchmarks with diverse LLMs, CoRA reveals that high MCQA performance can correspond to low consistency, and it significantly downscales scores for inconsistent models. Ablation analyses show robustness to the construction of divergent question sets, and the work provides open-source tooling for replication. This approach enhances benchmark reliability and supports safer, more dependable LLM deployment by explicitly accounting for response consistency.

Abstract

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

TL;DR

This paper addresses the unreliability of MC benchmarks for LLM evaluation by introducing CoRA, a metric that adjusts MCQA scores according to response consistency observed over divergent distractor sets. CoRA computes BMCA and CI to quantify minimal consistency and the consistency gap, then rebalances MCQA with CI to yield a more faithful measure of knowledge. Across MedQA and general benchmarks with diverse LLMs, CoRA reveals that high MCQA performance can correspond to low consistency, and it significantly downscales scores for inconsistent models. Ablation analyses show robustness to the construction of divergent question sets, and the work provides open-source tooling for replication. This approach enhances benchmark reliability and supports safer, more dependable LLM deployment by explicitly accounting for response consistency.

Abstract

In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Paper Structure

This paper contains 14 sections, 10 equations, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Illustration of the methods to create divergent sets of alternatives