Large Language Models Are Active Critics in NLG Evaluation
Shuying Xu, Junjie Hu, Ming Jiang
TL;DR
This work targets the inflexibility of traditional LLM-based NLG evaluation that relies on fixed task descriptions and criteria. It introduces Active-Critic, a two-stage framework where an LLM first self-infers the target NLG task $T$ and its evaluation criteria from a small set of human-labeled examples, then dynamically optimizes prompts to produce human-aligned scores with detailed justifications. Across four diverse NLG tasks and multiple backbones, Active-Critic achieves higher correlation with human judgments than strong baselines, even with as few as $5$ labeled examples, and its ablations show task inference as a major contributor to performance. The approach yields interpretable, criterion-specific explanations and demonstrates generalization to unseen data, reducing manual prompt engineering while improving the reliability of NLG evaluation in practice.
Abstract
The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned scores along with detailed justifications. Our experiments show that Active-Critic can generate nuanced, context-aware evaluation criteria, enabling it to achieve superior alignment with human judgments across multiple tasks.
