Table of Contents
Fetching ...

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking

Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, Rebecca Qian

TL;DR

GLIDER tackles the challenge of robust, privacy-preserving evaluation of LLM outputs by training a small, 3.8B-parameter evaluator to score arbitrary inputs against user-defined criteria using fine-grained pointwise and pairwise rankings. It leverages a four-stage synthetic data pipeline across 183 metrics and 685 domains, with explainable reasoning chains and highlight spans to improve transparency and performance. Empirical results show GLIDER outperforms open-source judges and remains competitive with GPT-4o-family models on multiple benchmarks, while demonstrating strong multilingual generalization and benefit from highlight spans in both pointwise and multi-criteria settings. The work also emphasizes reproducibility and openness, providing open-source code and data to advance research in scalable, interpretable model evaluation.

Abstract

The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking

TL;DR

GLIDER tackles the challenge of robust, privacy-preserving evaluation of LLM outputs by training a small, 3.8B-parameter evaluator to score arbitrary inputs against user-defined criteria using fine-grained pointwise and pairwise rankings. It leverages a four-stage synthetic data pipeline across 183 metrics and 685 domains, with explainable reasoning chains and highlight spans to improve transparency and performance. Empirical results show GLIDER outperforms open-source judges and remains competitive with GPT-4o-family models on multiple benchmarks, while demonstrating strong multilingual generalization and benefit from highlight spans in both pointwise and multi-criteria settings. The work also emphasizes reproducibility and openness, providing open-source code and data to advance research in scalable, interpretable model evaluation.

Abstract

The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.

Paper Structure

This paper contains 32 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: GLIDER is capable of outputting high quality reasoning chains, scores and explainable highlight spans
  • Figure 2: Distributions for rubric scales and scores across the dataset
  • Figure 3: Generating script for pointwise data. The tags dictionary contains 15 random tags each for model input, output, context and gold answer
  • Figure 4: Prompt for generation of pairwise preference data points for training
  • Figure 5: Data verification prompt
  • ...and 1 more figures