xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Qingchen Yu; Zifan Zheng; Shichao Song; Zhiyu Li; Feiyu Xiong; Bo Tang; Ding Chen

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Qingchen Yu, Zifan Zheng, Shichao Song, Zhiyu Li, Feiyu Xiong, Bo Tang, Ding Chen

TL;DR

The paper tackles the reliability of LLM evaluation pipelines, revealing that RegEx-based answer extraction and judge-model approaches can yield inconsistent judgments. It introduces xFinder, a specialized evaluator trained on the Key Answer Finder (KAF) dataset to improve key-answer extraction and matching. Across extensive experiments, xFinder achieves near-perfect extraction and high judgment accuracy, with substantially lower costs than GPT-4-based evaluation. Real-world tests show xFinder provides more stable rankings than existing frameworks, underscoring its potential to underpin trustworthy LLM evaluation at scale.

Abstract

The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. As evaluation frameworks commonly use Regular Expression (RegEx) for answer extraction, models may adjust their responses to fit formats easily handled by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. Furthermore, recent studies proposing fine-tuned LLMs as judge models for automated evaluation face challenges in terms of generalization ability and fairness. This paper comprehensively analyzes the entire LLM evaluation chain and demonstrates that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability. Our findings suggest that improving the key answer extraction module can lead to higher judgment accuracy and improved evaluation efficiency compared to the judge models. To address these issues, we propose xFinder, a novel evaluator for answer extraction and matching in LLM evaluation. As part of this process, we create a specialized dataset, the \textbf{K}ey \textbf{A}nswer \textbf{F}inder (KAF) dataset, to ensure effective model training and evaluation. Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42\%. In contrast, RegEx accuracy in the best evaluation framework is 74.38\%. The final judgment accuracy of xFinder reaches 97.61\%, outperforming existing evaluation frameworks and judge models.

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

TL;DR

Abstract

Paper Structure (51 sections, 4 equations, 32 figures, 43 tables)

This paper contains 51 sections, 4 equations, 32 figures, 43 tables.

Introduction
Data Item Transformation
Question Prompting and LLM Answering
Key Answer Extraction and Matching
Related Work
Problem Definition
Direct
Prompt wrapped
Converted question wrapped
Methodology
LLM Response Generation
Auto Labelling and Human Recheck
Training xFinder
Experiments
Extraction Accuracy: xFinder vs. RegEx
...and 36 more sections

Figures (32)

Figure 1: Typical LLM Evaluation Pipeline.
Figure 2: Cases where LM Eval Harness and OpenCompass fail in extracting key answers. A/T/C/M stands for tasks with alphabet / short text / categorical label / math options, respectively.
Figure 3: Schematic of the research framework. The first three stages correspond to Sections \ref{['sec:QAG']}, \ref{['sec:al-hr']}, and \ref{['sec:train-xFinder']}, while the final stage illustrates the replacement of RegEx with xFinder in the evaluation pipeline. The experiments in Section \ref{['sec:real-world-expt']} demonstrate the efficacy of our approach within this pipeline. Note: The percentages 80.3%, 77.0%, and 2.2% in the center of the figure around "unreliable evaluation" indicate results from the Llama3-8B-Instruct on the GSM8K benchmark using RegEx evaluation via the LM Eval Harness, OpenCompass, and UltraEval frameworks, respectively, while our method achieves a reliable result of 80.2%.
Figure 4: Bump charts: Changes in LLM rank over different evaluation frameworks.
Figure 5: Label Studio Interface.
...and 27 more figures

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

TL;DR

Abstract

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (32)