Table of Contents
Fetching ...

Large Language Models Are Active Critics in NLG Evaluation

Shuying Xu, Junjie Hu, Ming Jiang

TL;DR

This work targets the inflexibility of traditional LLM-based NLG evaluation that relies on fixed task descriptions and criteria. It introduces Active-Critic, a two-stage framework where an LLM first self-infers the target NLG task $T$ and its evaluation criteria from a small set of human-labeled examples, then dynamically optimizes prompts to produce human-aligned scores with detailed justifications. Across four diverse NLG tasks and multiple backbones, Active-Critic achieves higher correlation with human judgments than strong baselines, even with as few as $5$ labeled examples, and its ablations show task inference as a major contributor to performance. The approach yields interpretable, criterion-specific explanations and demonstrates generalization to unseen data, reducing manual prompt engineering while improving the reliability of NLG evaluation in practice.

Abstract

The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned scores along with detailed justifications. Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria, enabling it to achieve superior alignment with human judgments across multiple tasks.

Large Language Models Are Active Critics in NLG Evaluation

TL;DR

This work targets the inflexibility of traditional LLM-based NLG evaluation that relies on fixed task descriptions and criteria. It introduces Active-Critic, a two-stage framework where an LLM first self-infers the target NLG task and its evaluation criteria from a small set of human-labeled examples, then dynamically optimizes prompts to produce human-aligned scores with detailed justifications. Across four diverse NLG tasks and multiple backbones, Active-Critic achieves higher correlation with human judgments than strong baselines, even with as few as labeled examples, and its ablations show task inference as a major contributor to performance. The approach yields interpretable, criterion-specific explanations and demonstrates generalization to unseen data, reducing manual prompt engineering while improving the reliability of NLG evaluation in practice.

Abstract

The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned scores along with detailed justifications. Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria, enabling it to achieve superior alignment with human judgments across multiple tasks.

Paper Structure

This paper contains 42 sections, 3 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of Active-Critic, including two stages: (1) task inference, where the LLM is instructed to derive the target NLG evaluation task description and relevant criteria from data samples, and (2) scoring alignment, allowing the LLM to generate multi-criteria and overall quality scores along with accompanying explanations.
  • Figure 2: Average correlation between Orca2-based Active-Critic and human judgments with varying label sizes. Results for each correlation coefficient are provided in Appendix \ref{['sec:appendix-trainingsize']}
  • Figure 3: Impact of prompt optimization on scoring and mini-batch iterations on task inference (Kendall-Tau %). See Appendix \ref{['sec:appendix-optimization']} for Pearson and Spearman results.
  • Figure 4: Results of Active-Critic's dependence on human-scored data by Pearson, Spearman, and Kendell-Tau, respectively.
  • Figure 5: Effectiveness of Optimization. We report the Pearson ($\gamma$) correlation coefficient for our two optimal experimental variants: AC-Coarse and AC-Fine.
  • ...and 4 more figures