Table of Contents
Fetching ...

Language Models can Evaluate Themselves via Probability Discrepancy

Tingyu Xia, Bowen Yu, Yuan Wu, Yi Chang, Chang Zhou

TL;DR

ProbDiff introduces a self-evaluation paradigm for LLMs based on the probability discrepancy between an initial answer and its revision, enabling comparison without external judges or proprietary evaluators. The method rests on the observation that more capable LLMs exhibit flatter log-probability distributions, which leads to smaller variance across revisions and larger discrepancies correlating with weaker performance. Across translation, summarization, Xiaohongshu blog writing, and multiple benchmarks (MT-Bench, AlpacaEval, AlignBench), ProbDiff achieves results comparable to GPT-4-based evaluations while avoiding data leakage and API costs. Limitations include not quantifying the magnitude of improvement and sensitivity to output length, suggesting avenues for future refinement and broader adoption as a robust supplementary evaluation tool.

Abstract

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.

Language Models can Evaluate Themselves via Probability Discrepancy

TL;DR

ProbDiff introduces a self-evaluation paradigm for LLMs based on the probability discrepancy between an initial answer and its revision, enabling comparison without external judges or proprietary evaluators. The method rests on the observation that more capable LLMs exhibit flatter log-probability distributions, which leads to smaller variance across revisions and larger discrepancies correlating with weaker performance. Across translation, summarization, Xiaohongshu blog writing, and multiple benchmarks (MT-Bench, AlpacaEval, AlignBench), ProbDiff achieves results comparable to GPT-4-based evaluations while avoiding data leakage and API costs. Limitations include not quantifying the magnitude of improvement and sensitivity to output length, suggesting avenues for future refinement and broader adoption as a robust supplementary evaluation tool.

Abstract

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their less skilled counterparts. Expanding on this foundational insight, we propose a new self-evaluation method ProbDiff for assessing the efficacy of various LLMs. This approach obviates the necessity for an additional evaluation model or the dependence on external, proprietary models like GPT-4 for judgment. It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. A higher discrepancy for a given query between two LLMs indicates a relatively weaker capability. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4, spanning a range of scenarios that include natural language generation (NLG) tasks such as translation, summarization, and our proposed Xiaohongshu blog writing task, and benchmarks for LLM evaluation like AlignBench, MT-Bench, and AlpacaEval, across LLMs of varying magnitudes.
Paper Structure (17 sections, 4 equations, 5 figures, 9 tables)

This paper contains 17 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: An overview of ProbDiff, wherein an LLM iteratively revises its responses, calculating the resulting probability discrepancies as an self-evaluation metric. Larger discrepancies imply decreased confidence in the generated outcomes, with greater variances indicating poorer performance.
  • Figure 2: Log probability curves of the responses for Yi-34B-Chat and WizardLM-70B on AlpacaEval-2.0.
  • Figure 3: We recognize and leverage the observation that superior LLMs typically exhibit smaller probability variances, along with the conclusion that the model-generated samples tend to reside in regions of negative curvature within the probability function. These findings serve as crucial distinctions for ProbDiff in discerning between models of varying capabilities.
  • Figure 4: Evaluate the validity of the Qwen-14B-Chat and Qwen-14B-Chat_ft through GPT-4 in AlignBench(Align) and Xiaohongshu Blog Writing(Blog) tasks. "Align_gpt" and "Blog_gpt" represents the win rate judged by GPT-4, "Align_prob" and "Blog_prob" represents the confidence evaluate by ProbDiff. Orange histogram indicates fine-tuned Qwen-14B-Chat and blue histogram indicates Qwen-14B-Chat, respectively.
  • Figure 5: Xiaohongshu blog writing data.