Table of Contents
Fetching ...

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

TL;DR

*-PLUIE, task specific prompting variants of ParaPLUIE are introduced and their alignment with human judgement is evaluated and it is shown that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

TL;DR

*-PLUIE, task specific prompting variants of ParaPLUIE are introduced and their alignment with human judgement is evaluated and it is shown that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
Paper Structure (20 sections, 6 equations, 25 figures, 10 tables)

This paper contains 20 sections, 6 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: *-PLUIE workflow.
  • Figure 2: Example of data in the ParaReval dataset
  • Figure 3: Score distribution of Modern BertScore (a) and Fr-PLUIE (b). The blue, orange, red and green curves denote respectively the accuracy, recall, precision and F1-scores according to the decision threshold. Emphasis is placed on the maximum F1-score obtained by the metric.
  • Figure 4: Accuracy, recall, precision and F1-score score distribution over different threshold values for Modern BertScore (a) and Net-PLUIE (b). Emphasis is placed on the maximum F1-score obtained by the metric.
  • Figure 5: Llama translations
  • ...and 20 more figures