*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle; Léane Jourdan; Daisy Munson; Pierre Alain; Jonathan Chevelu; Arnaud Delhay; Damien Lolive

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive

TL;DR

*-PLUIE, task specific prompting variants of ParaPLUIE are introduced and their alignment with human judgement is evaluated and it is shown that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Abstract

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 25 figures, 10 tables)

This paper contains 20 sections, 6 equations, 25 figures, 10 tables.

Introduction
Experimental Protocol
Semantic Tasks
Baseline Metrics
*-PLUIE Metrics
Results
Classification
Preference
Computational Efficiency
Conclusion
ParaPLUIE definition
French Paraphrase Dataset
Example of data from ParaReval dataset
llmjudge prompts
*-PLUIE new task specific prompts
...and 5 more sections

Figures (25)

Figure 1: *-PLUIE workflow.
Figure 2: Example of data in the ParaReval dataset
Figure 3: Score distribution of Modern BertScore (a) and Fr-PLUIE (b). The blue, orange, red and green curves denote respectively the accuracy, recall, precision and F1-scores according to the decision threshold. Emphasis is placed on the maximum F1-score obtained by the metric.
Figure 4: Accuracy, recall, precision and F1-score score distribution over different threshold values for Modern BertScore (a) and Net-PLUIE (b). Emphasis is placed on the maximum F1-score obtained by the metric.
Figure 5: Llama translations
...and 20 more figures

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

TL;DR

Abstract

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (25)