Table of Contents
Fetching ...

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin

TL;DR

ContrastScore introduces a contrastive evaluation metric that uses a weaker amateur model to calibrate the judgments of a stronger expert model, addressing limitations of single-model and reference-based metrics. By computing $\text{ContrastScore} = \sum_{t} w_t \log\left( \left| p_{\text{EXP}}^{t} - \gamma p_{\text{AMA}}^{t} \right| \right)$ with $\gamma \in [0,1]$, the method emphasizes token-level disagreements that align better with human ratings. Empirical results on MT and summarization show superior Pearson correlations with human judgments, reduced likelihood bias, and notable efficiency gains (up to ~1.7× faster) using two smaller models (e.g., $3\mathrm{B}$ and $0.5\mathrm{B}$). The work demonstrates that contrastive evaluation can achieve robust quality assessment with lower computational cost, offering a practical path toward more reliable automatic NLG evaluation.

Abstract

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

TL;DR

ContrastScore introduces a contrastive evaluation metric that uses a weaker amateur model to calibrate the judgments of a stronger expert model, addressing limitations of single-model and reference-based metrics. By computing with , the method emphasizes token-level disagreements that align better with human ratings. Empirical results on MT and summarization show superior Pearson correlations with human judgments, reduced likelihood bias, and notable efficiency gains (up to ~1.7× faster) using two smaller models (e.g., and ). The work demonstrates that contrastive evaluation can achieve robust quality assessment with lower computational cost, offering a practical path toward more reliable automatic NLG evaluation.

Abstract

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

Paper Structure

This paper contains 36 sections, 8 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Efficiency and quality of ContrastScore with smaller models compared to single larger model on summarization task.
  • Figure 2: Exploration of the impacts of $\gamma$. Testing correlation between evaluator score and human score in ZH-EN language pair on MQM23.
  • Figure 3: Exploration of weighted ensemble parameter $\gamma$ for LLaMA-3.2 (3B, 1B) on summarization. Best quality occurs at $\gamma = 0.55$ (star), but remains below ContrastScore using the same models. Horizontal lines show individual model performance.