A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization
KuanChao Chu, Yi-Pei Chen, Hideki Nakayama
TL;DR
The paper investigates how prompt design, specifically the order of output instructions and the inclusion of reasons, shapes LLM-based evaluation of dialogue quality. Through systematic experiments across six prompt configurations and multiple models, it finds that placing reasons before the score generally improves scoring and reduces sensitivity to prompt variations, with rule-based prompt elements amplifying this effect. It then explores prompt optimization methods (GRIPS and OPRO) using SummEval coherence data, showing GRIPS can meaningfully improve alignment with human judgments while OPRO's gains depend on iteration and data availability. The findings provide actionable guidance for building more reliable, subjectivity-tolerant LLM evaluators for text generation and dialogue tasks, highlighting the value of reason-first prompts and data-driven prompt optimization.
Abstract
This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
