Table of Contents
Fetching ...

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

Jennifer D'Souza, Hamed Babaei Giglou, Quentin Münch

TL;DR

YESciEval tackles the robustness and transparency of LLM-based evaluation in scienceQ&A by pairing fine-grained nine-rubric assessments with an open-source LLM-as-a-judge trained through supervised fine-tuning and reinforcement learning. It deploys rubric-anchored adversarial datasets across open datasets ORKGSynthesis and BioASQ to stress-test evaluation quality while avoiding dependence on proprietary models or human feedback. The two-stage alignment (SFT + RL with Contrastive Preference Optimization) improves the evaluator’s ability to distinguish high- from low-quality syntheses, including rubric-specific adversarial perturbations. The framework demonstrates cross-model generalizability, enabling scalable, zero-cost evaluation of scienceQ&A that supports AI alignment and trustworthy scientific inquiry.

Abstract

Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.

YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

TL;DR

YESciEval tackles the robustness and transparency of LLM-based evaluation in scienceQ&A by pairing fine-grained nine-rubric assessments with an open-source LLM-as-a-judge trained through supervised fine-tuning and reinforcement learning. It deploys rubric-anchored adversarial datasets across open datasets ORKGSynthesis and BioASQ to stress-test evaluation quality while avoiding dependence on proprietary models or human feedback. The two-stage alignment (SFT + RL with Contrastive Preference Optimization) improves the evaluator’s ability to distinguish high- from low-quality syntheses, including rubric-specific adversarial perturbations. The framework demonstrates cross-model generalizability, enabling scalable, zero-cost evaluation of scienceQ&A that supports AI alignment and trustworthy scientific inquiry.

Abstract

Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.

Paper Structure

This paper contains 34 sections, 17 figures, 16 tables.

Figures (17)

  • Figure 1: Examples from two domains in the YESciEval science Q&A dataset. Orange boxes show LLM input: a research question and titles of top-ranked papers (abstracts omitted for brevity). Green boxes show answer snippets from two LLMs. Light/dark gray boxes represent subtle/extreme adversarial variants targeting the conciseness and correctness rubrics. Yellow highlights indicate perturbations. YESciEval uses a nine-rubric LLM-as-a-judge scheme and tests robustness via rubric-specific adversarial edits (see \ref{['sec:ext-scienceqa']} for details).
  • Figure 2: YESciEval LLM-as-a-Judge Alignment: Supervised fine-tuning of $LLM_{eval}$, followed by reinforcement learning via Contrastive Preference Optimization to align open-source LLMs with desired rubric-level evaluations.
  • Figure 3: Heatmaps depicting agreement for synthesis evaluations on benign datasets. The x-axis represents the $LLM_{gen}$, while the y-axis denotes the $LLM_{eval}$.
  • Figure 4: Evaluation of synthesis across different models and fine-tuning strategies on BioASQ and ORKGSynthesis datasets. The nine-rubrics include Coherence (Cohr), Cohesion (Cohs), Completeness (Comp), Concisenes (Conc), Correctness (Corr), Informativeness (Info), Integration (Integ), Readability (Read), and Relevancy (Relv).
  • Figure 5: Number of Questions per Research Field on the ORKGSyn Dataset. The y-axis represents the "Research Fields".
  • ...and 12 more figures