Language Models can Evaluate Themselves via Probability Discrepancy
Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou
TL;DR
ProbDiff introduces a self-evaluation paradigm for LLMs based on the probability discrepancy between an initial answer and its revision, enabling comparison without external judges or proprietary evaluators. The method rests on the observation that more capable LLMs exhibit flatter log-probability distributions, which leads to smaller variance across revisions and larger discrepancies correlating with weaker performance. Across translation, summarization, Xiaohongshu blog writing, and multiple benchmarks (MT-Bench, AlpacaEval, AlignBench), ProbDiff achieves results comparable to GPT-4-based evaluations while avoiding data leakage and API costs. Limitations include not quantifying the magnitude of improvement and sensitivity to output length, suggesting avenues for future refinement and broader adoption as a robust supplementary evaluation tool.
Abstract
In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.
