LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
Yi-Pei Chen, KuanChao Chu, Hideki Nakayama
TL;DR
This paper addresses the reliability of LLM-based dialogue evaluation prompts, noting that scoring is highly sensitive to prompt structure and inherent subjectivity. It systematically compares six prompt configurations that swap whether explanations precede or follow the score and tests across four GPT-family models, using 25 dialogue sets with $1$ to $10$ scoring scales. The results show that a reason-first prompt generally yields higher and more consistent scores, and that model behavior is influenced by autoregressive dynamics linking reasons to the final score; removing task-specific rules further reduces differences, underscoring prompt sensitivity. These findings provide concrete guidance for prompt design in subjective evaluation tasks, aiming to improve the reliability and interpretability of LLM-based dialogue scoring in practical applications.
Abstract
This research investigates the effect of prompt design on dialogue evaluation using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for dialogue evaluation remains challenging due to model sensitivity and subjectivity in dialogue assessments. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a "reason-first" approach yielding more comprehensive evaluations. This insight is crucial for enhancing the accuracy and consistency of LLM-based evaluations.
