Table of Contents
Fetching ...

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li

TL;DR

ARJudge tackles robustness gaps in open source evaluators by enabling adaptive criterion generation and multi-faceted evaluation that combines text- and code-driven analyses. The framework comprises a fine-tuned Analyzer and a tuning-free Refiner trained on a Composite Analysis Corpus to produce diverse analyses and a final judgment. Empirical results across multiple benchmarks show that ARJudge surpasses existing fine-tuned evaluators and that including code-driven analyses yields notable gains, with improvements up to 11.1% in instruction-following evaluation and stronger generalization on unseen samples. The work highlights the value of multi-faceted, code-aware evaluation for scalable, reliable LLM assessment and points to future tool integrations.

Abstract

Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.

Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework

TL;DR

ARJudge tackles robustness gaps in open source evaluators by enabling adaptive criterion generation and multi-faceted evaluation that combines text- and code-driven analyses. The framework comprises a fine-tuned Analyzer and a tuning-free Refiner trained on a Composite Analysis Corpus to produce diverse analyses and a final judgment. Empirical results across multiple benchmarks show that ARJudge surpasses existing fine-tuned evaluators and that including code-driven analyses yields notable gains, with improvements up to 11.1% in instruction-following evaluation and stronger generalization on unseen samples. The work highlights the value of multi-faceted, code-aware evaluation for scalable, reliable LLM assessment and points to future tool integrations.

Abstract

Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.

Paper Structure

This paper contains 33 sections, 1 equation, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparison of previous fine-tuned evaluators and our framework. Left is a former model and Right is our ARJudge. The Analyzer adaptively defines evaluation criteria and conducts multi-faceted analyses in various forms, e.g., text or code. The Refiner combines all preceding analyses and produces the final evaluation.
  • Figure 2: The overview of the corpus construction. "R1" and "R2" denote two candidate responses with a preference annotation. "Sample Responses" are newly sampled responses that we use as references to generate evaluation questions and code scripts. Step (1) produces two types of evaluation questions, respectively. Step (2) and Step (3) develop corresponding text-based and code-driven analyses.
  • Figure 3: Results on the consistency between code-driven evaluation and IFEval evaluation. "Loose" and "Strict" are two judgment criteria in IFEval.
  • Figure 4: Evaluation results with increasing analyses. The right displays the results of four subsets in LLMBar.
  • Figure 5: An example of evaluation generated by ARJudge.
  • ...and 8 more figures