Table of Contents
Fetching ...

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

Yi-Pei Chen, KuanChao Chu, Hideki Nakayama

TL;DR

This paper addresses the reliability of LLM-based dialogue evaluation prompts, noting that scoring is highly sensitive to prompt structure and inherent subjectivity. It systematically compares six prompt configurations that swap whether explanations precede or follow the score and tests across four GPT-family models, using 25 dialogue sets with $1$ to $10$ scoring scales. The results show that a reason-first prompt generally yields higher and more consistent scores, and that model behavior is influenced by autoregressive dynamics linking reasons to the final score; removing task-specific rules further reduces differences, underscoring prompt sensitivity. These findings provide concrete guidance for prompt design in subjective evaluation tasks, aiming to improve the reliability and interpretability of LLM-based dialogue scoring in practical applications.

Abstract

This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation

TL;DR

This paper addresses the reliability of LLM-based dialogue evaluation prompts, noting that scoring is highly sensitive to prompt structure and inherent subjectivity. It systematically compares six prompt configurations that swap whether explanations precede or follow the score and tests across four GPT-family models, using 25 dialogue sets with to scoring scales. The results show that a reason-first prompt generally yields higher and more consistent scores, and that model behavior is influenced by autoregressive dynamics linking reasons to the final score; removing task-specific rules further reduces differences, underscoring prompt sensitivity. These findings provide concrete guidance for prompt design in subjective evaluation tasks, aiming to improve the reliability and interpretability of LLM-based dialogue scoring in practical applications.

Abstract

This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
Paper Structure (9 sections, 2 figures, 2 tables)

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Score distribution across 50 trials for each model and output instruction configuration for a dialogue set.
  • Figure 2: Score distribution across 50 trials for each model and output instruction configuration for a dialogue set, with the 'special rules' omitted from the prompt.