Table of Contents
Fetching ...

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

KuanChao Chu, Yi-Pei Chen, Hideki Nakayama

TL;DR

The paper investigates how prompt design, specifically the order of output instructions and the inclusion of reasons, shapes LLM-based evaluation of dialogue quality. Through systematic experiments across six prompt configurations and multiple models, it finds that placing reasons before the score generally improves scoring and reduces sensitivity to prompt variations, with rule-based prompt elements amplifying this effect. It then explores prompt optimization methods (GRIPS and OPRO) using SummEval coherence data, showing GRIPS can meaningfully improve alignment with human judgments while OPRO's gains depend on iteration and data availability. The findings provide actionable guidance for building more reliable, subjectivity-tolerant LLM evaluators for text generation and dialogue tasks, highlighting the value of reason-first prompts and data-driven prompt optimization.

Abstract

This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.

A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization

TL;DR

The paper investigates how prompt design, specifically the order of output instructions and the inclusion of reasons, shapes LLM-based evaluation of dialogue quality. Through systematic experiments across six prompt configurations and multiple models, it finds that placing reasons before the score generally improves scoring and reduces sensitivity to prompt variations, with rule-based prompt elements amplifying this effect. It then explores prompt optimization methods (GRIPS and OPRO) using SummEval coherence data, showing GRIPS can meaningfully improve alignment with human judgments while OPRO's gains depend on iteration and data availability. The findings provide actionable guidance for building more reliable, subjectivity-tolerant LLM evaluators for text generation and dialogue tasks, highlighting the value of reason-first prompts and data-driven prompt optimization.

Abstract

This research investigates prompt designs of evaluating generated texts using large language models (LLMs). While LLMs are increasingly used for scoring various inputs, creating effective prompts for open-ended text evaluation remains challenging due to model sensitivity and subjectivity in evaluation of text generation. Our study experimented with different prompt structures, altering the sequence of output instructions and including explanatory reasons. We found that the order of presenting reasons and scores significantly influences LLMs' scoring, with a different level of rule understanding in the prompt. An additional optimization may enhance scoring alignment if sufficient data is available. This insight is crucial for improving the accuracy and consistency of LLM-based evaluations.
Paper Structure (18 sections, 3 figures, 5 tables)

This paper contains 18 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Score distribution across 50 trials for each model and output instruction configuration for a dialogue set.
  • Figure 2: The form of prompt for LLM scorer in conversation evaluation. The special rules section is simplified for better readability.
  • Figure 3: Score distribution across 50 trials for each model and output instruction configuration for a dialogue set, with the 'special rules' omitted from the prompt.