Table of Contents
Fetching ...

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre

TL;DR

The results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators, and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

Abstract

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

TL;DR

The results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators, and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.

Abstract

Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
Paper Structure (37 sections, 3 equations, 3 figures, 10 tables)

This paper contains 37 sections, 3 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of the evaluation and alignment pipeline. French medical OEQA questions are answered by multiple LLM generators and annotated by a clinician for binary semantic equivalence. LLM evaluators judge the same instances using an identical prompt. A lightweight evaluator (Phi-3.5-mini) is further aligned via SFT and GRPO, and improvements are validated through paired significance testing.
  • Figure 2: Heatmap of F1 scores for each judge model across answer-generating models.
  • Figure 3: Comparison of F1 scores of Phi models on the text generated by multiple LLMs