Table of Contents
Fetching ...

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

Genta Indra Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, Derry Tanti Wijaya

TL;DR

The paper addresses the misalignment between standard generation-evaluation metrics and human preferences by introducing MetaMetrics, a supervisedly calibrated meta-metric that learns to optimally combine existing metrics. It formalizes a framework to normalize, weight, and fuse multiple base metrics using Bayesian Optimization or boosting, enabling reference-based and reference-free evaluation across language and vision tasks. Across abstractive summarization, machine translation, question answering, image captioning, and reward-model scoring, MetaMetrics demonstrates superior or competitive alignment with human judgments, plus robustness to cross-lingual and cross-dataset shifts. The approach emphasizes interpretability, efficiency, and parallelizability, making it practical for real-world deployment and extensible to new modalities and tasks.

Abstract

Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.

MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences

TL;DR

The paper addresses the misalignment between standard generation-evaluation metrics and human preferences by introducing MetaMetrics, a supervisedly calibrated meta-metric that learns to optimally combine existing metrics. It formalizes a framework to normalize, weight, and fuse multiple base metrics using Bayesian Optimization or boosting, enabling reference-based and reference-free evaluation across language and vision tasks. Across abstractive summarization, machine translation, question answering, image captioning, and reward-model scoring, MetaMetrics demonstrates superior or competitive alignment with human judgments, plus robustness to cross-lingual and cross-dataset shifts. The approach emphasizes interpretability, efficiency, and parallelizability, making it practical for real-world deployment and extensible to new modalities and tasks.

Abstract

Understanding the quality of a performance evaluation metric is crucial for ensuring that model outputs align with human preferences. However, it remains unclear how well each metric captures the diverse aspects of these preferences, as metrics often excel in one particular area but not across all dimensions. To address this, it is essential to systematically calibrate metrics to specific aspects of human preference, catering to the unique characteristics of each aspect. We introduce MetaMetrics, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. MetaMetrics optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. MetaMetrics aligns closely with human preferences and is highly extendable and easily integrable into any application. This makes MetaMetrics a powerful tool for improving the evaluation of generation tasks, ensuring that metrics are more representative of human judgment across diverse contexts.
Paper Structure (125 sections, 2 equations, 11 figures, 20 tables, 1 algorithm)

This paper contains 125 sections, 2 equations, 11 figures, 20 tables, 1 algorithm.

Figures (11)

  • Figure 1: Examples of image captioning on THumB 1.0 dataset comparing $\textcolor{black}{MetaMetrics}$ against BLEU and CLIP-S scores. $\textcolor{black}{MetaMetrics}$ scores of predicted captions are closer to human ratings compared to BLEU and CLIP-S scores in the left and right images, respectively.
  • Figure 2: Kendall correlation results of $\textcolor{black}{MetaMetrics}$-Cap. Cross-Dataset means that $\textcolor{black}{MetaMetrics}$ is tuned on THumB 1.0 and tested on Flickr8k, and vice versa. Our $\textcolor{black}{MetaMetrics}$-Cap outperforms other metrics across different datasets and shows robustness even when trained and tested on different datasets (Cross-Dataset setting). More detailed results for all metrics and their variants can be found in Table \ref{['tab:full-image_caption']} in the Appendix.
  • Figure 3: GP weights (Left) and XGBoost Features Importance (Center) with Intra-Metric Correlation (Right) for $\textcolor{black}{MetaMetrics}$-QA.
  • Figure 4: $\textcolor{black}{MetaMetrics}$ Framework for reference-free ($\theta_\textsc{MM}^\text{noref}$, left) and reference-based setting ($\theta_\textsc{MM}^\text{ref}$, right). $\textcolor{black}{MetaMetrics}$ integrates multiple metrics $\theta_i$ and their scores $\hat{y}_i$, learns a function $\Phi$ to combine them into a score $\hat{y}_\textsc{MM}$ that aligns best with the human judgment score.
  • Figure 5: Weights and Feature Importance for $\textcolor{black}{MetaMetrics}$-QA.
  • ...and 6 more figures