Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

Iker De la Iglesia; Iakes Goenaga; Johanna Ramirez-Romero; Jose Maria Villa-Gonzalez; Josu Goikoetxea; Ander Barrena

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

Iker De la Iglesia, Iakes Goenaga, Johanna Ramirez-Romero, Jose Maria Villa-Gonzalez, Josu Goikoetxea, Ander Barrena

TL;DR

This work introduces a novel evaluation methodology for LLM-generated medical explanatory arguments, relying on Proxy Tasks and rankings to closely align results with human evaluation criteria, overcoming the biases typically seen in LLMs used as judges.

Abstract

Evaluating LLM-generated text has become a key challenge, especially in domain-specific contexts like the medical field. This work introduces a novel evaluation methodology for LLM-generated medical explanatory arguments, relying on Proxy Tasks and rankings to closely align results with human evaluation criteria, overcoming the biases typically seen in LLMs used as judges. We demonstrate that the proposed evaluators are robust against adversarial attacks, including the assessment of non-argumentative text. Additionally, the human-crafted arguments needed to train the evaluators are minimized to just one example per Proxy Task. By examining multiple LLM-generated arguments, we establish a methodology for determining whether a Proxy Task is suitable for evaluating LLM-generated medical explanatory arguments, requiring only five examples and two human experts.

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

TL;DR

Abstract

Paper Structure (31 sections, 4 figures, 12 tables)

This paper contains 31 sections, 4 figures, 12 tables.

Introduction
Related Work
Experimental Setup
Proxy Tasks & Proxy Task LM Evaluators
Proxy Tasks Benchmarks
Medical Multiple Choice QA Benchmark
Misinformation Detection Benchmark
NLI Benchmark
Proxy Task LM Evaluators
Baseline Evaluator
Expert-Trained Evaluator
LLM-Trained Evaluator
Primary Arguments and Control Cases
LLM-Generated Synthetic Arguments
Control Cases
...and 16 more sections

Figures (4)

Figure 1: Graphical abstract illustrating the key elements of our approach. Synthetic arguments are first generated by prompting multiple LLMs, which are then ranked alongside gold-standard arguments by both our trained LM evaluator and a human expert. Our results show the LM evaluator aligns with human preferences.
Figure 2: A graphical abstract illustrating the system’s main components and behavior. The proposed LM evaluator prioritizes ranking primary arguments first and placing control cases last.
Figure 3: Ranking of the gold-standard argument alongside those generated by automatic systems. Each row corresponds to a distinct evaluator: the first three rows correspond to our proposed Proxy Task evaluators based on discriminative classification models, while the last row reflects the human criteria, obtained by having experts directly rank the arguments.
Figure 4: Ranking of the gold-standard argument, LLM-generated arguments, and Control Cases by the Proxy Task evaluators for each Proxy Task. Each row represents a distinct evaluator, and the columns include Primary Arguments (gold standard and LLM-generated) as well as Control Cases (No Argument, Label-Only Input, Noise Argument, and IR Passages). This table highlights the evaluators' ability to differentiate between proper and improper arguments.

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

TL;DR

Abstract

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

Authors

TL;DR

Abstract

Table of Contents

Figures (4)