Table of Contents
Fetching ...

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Fangyun Wei, Xi Chen, Lin Luo

TL;DR

The paper argues that MCQA-based evaluation of large language models inadequately captures semantic comprehension required in real tasks. It introduces RWQ-Elo, an Elo-based, two-player contest framework using the Real-World Questions benchmark and a GPT-4 judge to provide a more realistic, scalable evaluation across 24 LLMs. The authors demonstrate the stability and practicality of RWQ-Elo, including fast-registration for new models, and compare it to existing leaderboards like AlpacaEval and MT-Bench. Overall, RWQ-Elo offers a more discriminative, open-ended evaluation paradigm that better mirrors real-world usage and has the potential to reshape LLM ranking standards.

Abstract

Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

TL;DR

The paper argues that MCQA-based evaluation of large language models inadequately captures semantic comprehension required in real tasks. It introduces RWQ-Elo, an Elo-based, two-player contest framework using the Real-World Questions benchmark and a GPT-4 judge to provide a more realistic, scalable evaluation across 24 LLMs. The authors demonstrate the stability and practicality of RWQ-Elo, including fast-registration for new models, and compare it to existing leaderboards like AlpacaEval and MT-Bench. Overall, RWQ-Elo offers a more discriminative, open-ended evaluation paradigm that better mirrors real-world usage and has the potential to reshape LLM ranking standards.

Abstract

Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.
Paper Structure (10 sections, 2 equations, 9 figures, 22 tables)

This paper contains 10 sections, 2 equations, 9 figures, 22 tables.

Figures (9)

  • Figure 1: Statistics for our Real-World Question (RWQ) benchmark. Examples for each source are available in Table \ref{['tab:RWQ-examples']} of the appendix.
  • Figure 2: (a) Comparison of various leaderboards, including our RWQ-Elo (Elo rating for each LLM is reported in brackets), Chatbot Arena zheng2023judging, MT-Bench zheng2023judging and AlpacaEval (v1.0 and v2.0) alpaca_eval. (b) Statistics from running our RWQ-Elo systems 100 times. We show the Elo ratings for the selected 13 LLMs. The complete statistics can be found in Figure \ref{['fig:elo_results_all']}.
  • Figure 3: Differences between the win-rate map generated by our Elo system and the pre-calculated win-rate map are represented using absolute values. We include 13 LLMs. The two complete win-rate maps alongside their difference map can be found in Figure \ref{['fig:pre-calculated-map']}-\ref{['fig:elo-map']} and Figure \ref{['fig:complete-win-rate-map']} of the appendix.
  • Figure 4: Visualization of the win-rate trends between two LLMs ((a) Falcon-Instruct-40B v.s. MPT-Chat-30B; (b) Falcon-Instruct-7B v.s. Gemini-Pro). The horizontal lines represent the pre-calculated win rates. With the progression of each contest round, the win rate ascertained by our Elo rating system progressively converges with the pre-calculated win rate.
  • Figure 5: We compare our RWQ-Elo rating system with various AlpacaEval variants, where GPT-4-Turbo, GPT-3.5-Turbo, and LLaMA-1-13B serve as the competitors. RWQ-Elo (All) and RWQ-Elo (200) denote that the system is run using all instances and a random selection of 200 instances, respectively, from our RWQ benchmark. We utilize the same instances from RWQ-Elo (200) for AlpacaEval. While AlpacaEval uses win-rate as its metric, our RWQ-Elo system employs the Elo score as its metric. In AlpacaEval, when LLMs compete against an LLM that is significantly superior or inferior, it results in a lack of distinguishable performance differences among them. In contrast, our system does not exhibit this issue.
  • ...and 4 more figures