PHUDGE: Phi-3 as Scalable Judge

Mahesh Deshwal; Apoorva Chawla

PHUDGE: Phi-3 as Scalable Judge

Mahesh Deshwal, Apoorva Chawla

TL;DR

PHUDGE addresses scalable evaluation of LLM responses by fine-tuning Phi-3 with LoRA to function as a fast, high-accuracy judge. It reframes scoring as Classification or Regression rather than purely causal generation and introduces a generalized $\text{EMD}$ loss with a Minkowski-style smoothing parameter $\alpha$, yielding stable training and improved grading. Across four Feedback Evaluation tasks, the approach achieves state-of-the-art performance with competitive latency and strong correlation to GPT-4 and human judgments, outperforming larger baselines with less training data. The work also demonstrates effective data augmentation and proposes RAG-ready data strategies, offering practical insights for building robust, scalable judge models for real-world evaluation.

Abstract

In this paper cum technical report, we present PHUDGE A fine tuned Phi3 model that achieved SOTA results in 4 tasks as Feedback Test, Feedback OOD, MT Human, Preference Test surpassing each and every existing model in latency and throughput. It shows very strong correlation not only with GPT4 but with Human annotators too in unseen data as well as in both absolute and relative grading tasks. We have not only addressed the usage of small LMs for cost effective production grade systems but have also shown that Causal modelling is not only slow in nature but sometimes it can hinder models learning capabilities and should be replaced by simpler tasks whenever we can to make the overall system faster and better. We show that by following systematic ML experimentation, thoughtful data augmentation and re purposing the problem itself, we can even beat 10x bigger models even with lesser training data. To the best of our knowledge, we are re the first one to experiment and showcase the usage of generalised version of Earth Movers Distance AKA Wasserstein distance by using Minkowski Distance with a penalty to control loss smoothing and can be used as a loss function instead of Cross Entropy to get stable training and better results for grading tasks.

PHUDGE: Phi-3 as Scalable Judge

TL;DR

loss with a Minkowski-style smoothing parameter

, yielding stable training and improved grading. Across four Feedback Evaluation tasks, the approach achieves state-of-the-art performance with competitive latency and strong correlation to GPT-4 and human judgments, outperforming larger baselines with less training data. The work also demonstrates effective data augmentation and proposes RAG-ready data strategies, offering practical insights for building robust, scalable judge models for real-world evaluation.

Abstract

Paper Structure (16 sections, 3 equations, 4 figures, 4 tables)

This paper contains 16 sections, 3 equations, 4 figures, 4 tables.

Introduction
Related Work
Dataset
Train Data
Benchmark Test Data & Metrics
Methodology and Experiments
Base Model Selection & Test Data Leakage Test
Re-thinking Problem Definition
Classification
Regression
Ablation Study
Effect of Batch Size, LoRA and Sequence Length \ref{['training_graph']}
Augmentations
Evaluation and Results
Limitations
...and 1 more sections

Figures (4)

Figure 1: NLG Evaluation Methods
Figure 2: Training With CE loss (Early Overfitting)
Figure 3: Training with EMD Loss
Figure 4: Effects of Batch and modelling approach

PHUDGE: Phi-3 as Scalable Judge

TL;DR

Abstract

PHUDGE: Phi-3 as Scalable Judge

Authors

TL;DR

Abstract

Table of Contents

Figures (4)