Learned-Rule-Augmented Large Language Model Evaluators
Jie Meng, Jin Mao
TL;DR
This paper tackles the generalization gap of LLM evaluators by introducing a rule-augmented framework that distills scoring rules from data using LLM-assisted MCTS and applies them through Chain-of-Rule prompting (CoR) or a reinforcement-learning-trained Rule-Augmented Evaluator (RuAE). The authors demonstrate that CoR improves performance across models, and RuAE achieves strong results on long-form, complex tasks like ASAP and Relish, with large models excelling on SummEval. Their approach yields interpretable, human-aligned rules and demonstrates improved reasoning in evaluation scenarios across scoring, regression, ranking, and judging. The work highlights the practical potential of data-driven, rule-guided evaluators for broad NLG evaluation challenges, while noting computational costs and task-dependent limitations.
Abstract
Large language models (LLMs) are predominantly used as evaluators for natural language generation (NLG) tasks, but their application to broader evaluation scenarios remains limited. In this work, we explore the potential of LLMs as general evaluators across diverse tasks. Although LLM-based evaluators have made progress in different areas, existing methods struggle to generalize due to their reliance on costly, human-designed evaluation principles, which are often misaligned with both annotated data and LLMs' understanding.To address these challenges, we propose a rule-augmented evaluation paradigm. First, we introduce a rule distillation method that automatically extracts scoring rules from data using an LLM-assisted Monte Carlo Tree Search (MCTS), alleviating scalability issues and improving alignment with data. Second, to enable LLMs to effectively apply the learned rules, we propose two strategies: (1) Chain-of-Rule (CoR), which guides LLM to follow distilled rules, and (2) training a rule-augmented LLM evaluator (RuAE) via reinforcement learning, further bridging the gap between rules and LLMs' reasoning. Extensive experiments on diverse tasks demonstrate the effectiveness and generalizability of our approach across various evaluation scenarios.
