Finetuning LLMs for Comparative Assessment Tasks

Vatsal Raina; Adian Liusie; Mark Gales

Finetuning LLMs for Comparative Assessment Tasks

Vatsal Raina, Adian Liusie, Mark Gales

TL;DR

This work proposes a framework for finetuning LLMs for comparative assessment to align the model's output with the target distribution of comparative probabilities, and improves state-of-the-art performance while maintaining high performance with an efficient subset of comparisons.

Abstract

Automated assessment in natural language generation is a challenging task. Instruction-tuned large language models (LLMs) have shown promise in reference-free evaluation, particularly through comparative assessment. However, the quadratic computational complexity of pairwise comparisons limits its scalability. To address this, efficient comparative assessment has been explored by applying comparative strategies on zero-shot LLM probabilities. We propose a framework for finetuning LLMs for comparative assessment to align the model's output with the target distribution of comparative probabilities. By training on soft probabilities, our approach improves state-of-the-art performance while maintaining high performance with an efficient subset of comparisons.

Finetuning LLMs for Comparative Assessment Tasks

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 5 figures, 6 tables)

This paper contains 19 sections, 4 equations, 5 figures, 6 tables.

Introduction
Related work
LLM comparative assessment
Scoring methods
Finetuning Systems
Experiments
Data
Models
Results
Conclusions
Limitations
Ethics statement
Impact of $\gamma$
Relationship between PoE-BT and Thurstone-Mosteller
Prompts
...and 4 more sections

Figures (5)

Figure 1: USMLE response time estimation: Efficient comparisons with Llama-3.1.
Figure 2: CMCQRD difficulty estimation: Efficient comparisons with Llama-3.1.
Figure 3: Impact of distribution of training probabilities based on choice of $\gamma$ in sigmoid.
Figure 4: Linear mapping between $\sigma$ and $\Phi$.
Figure 5: Relationship of scores (from zero-shot GPT-4o mini) using POE-BT and POE-TM for response time estimation.

Finetuning LLMs for Comparative Assessment Tasks

TL;DR

Abstract

Finetuning LLMs for Comparative Assessment Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)