Table of Contents
Fetching ...

Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

Zijun Liu, Boqun Kou, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu

TL;DR

Meta Ranking (MR) presents a cross-query, reference-based framework that lets weak LLMs judge the reliability of a target response by comparing it to a small set of reference query–response pairs. By aggregating pairwise comparison signals with a simple scoring rule, MR achieves robust error detection across backbones and languages and does not rely on full model fine-tuning. The authors demonstrate MR’s value in two practical applications: model cascading, where unreliable open-source outputs are routed to stronger closed-source models, and instruction tuning, where MR-guided data filtering improves data efficiency and downstream performance. Overall, MR offers a data-efficient, scalable approach to improving both inference-time reliability and training-time data quality for LLM systems.

Abstract

Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called $\textit{Meta Ranking}$ (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, where weak LLMs, such as Phi-2, could surpass strong baselines like GPT-3.5-turbo, requiring only five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass Alpaca-13B with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.

Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

TL;DR

Meta Ranking (MR) presents a cross-query, reference-based framework that lets weak LLMs judge the reliability of a target response by comparing it to a small set of reference query–response pairs. By aggregating pairwise comparison signals with a simple scoring rule, MR achieves robust error detection across backbones and languages and does not rely on full model fine-tuning. The authors demonstrate MR’s value in two practical applications: model cascading, where unreliable open-source outputs are routed to stronger closed-source models, and instruction tuning, where MR-guided data filtering improves data efficiency and downstream performance. Overall, MR offers a data-efficient, scalable approach to improving both inference-time reliability and training-time data quality for LLM systems.

Abstract

Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwisely ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, where weak LLMs, such as Phi-2, could surpass strong baselines like GPT-3.5-turbo, requiring only five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs' performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass Alpaca-13B with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.
Paper Structure (57 sections, 14 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 57 sections, 14 equations, 10 figures, 13 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of our proposed Meta Ranking (MR) method. (a) Left: The table summarizes MR and previous judgement methods with different backbone models. (b) Right: The sub-figure illustrates different methods. "$\widehat{\text{S}}_\text{t}$" denotes the estimated score for the target query-response pair. "Query$_i$" (Q$_i$), "Response$_i$" (R$_i$), and "Score$_i$" (S$_i$) ($i=1,2$) denote the references and its score (e.g., +1 for correct and -1 for incorrect responses). MR takes two query-response pairs for cross-query comparison on reliability with language models, then aggregates the estimated score of the target query and response.
  • Figure 1: The micro precision scores on error detection experiments on the MMLU and CMMLU datasets with responses generated by different LLMs. The bold font denotes best results. LLMs in the second row of the header are sources of responses.
  • Figure 2: Example illustrations of MR process. The correctness of the target response ($\text{R}_\text{t}$) is considered according to comparisons with reference query-response pairs.
  • Figure 3: Results on instruction tuning experiments, where MR is implemented with Phi-2. The bold font denotes best results. "Full" denotes the whole dataset.
  • Figure 4: The change of precision scores with the number of reference pairs on the MMLU dataset with target responses from LLaMA-2.
  • ...and 5 more figures