Table of Contents
Fetching ...

Integrated Framework for LLM Evaluation with Answer Generation

Sujeong Lee, Hayoung Lee, Seongsoo Heo, Wonik Choi

TL;DR

SPEED is an active, multi-dimensional evaluation framework for LLMs that leverages domain-specific functional experts to diagnose hallucinations, toxicity, and lexical-context quality. It jointly generates reliable reference answers and analyzes candidate outputs through a three-stage process—diverse prompting, expert-backed feedback, and evaluation—yielding more interpretable and fair assessments than fixed benchmarks. Empirical results show that SPEED improves answer quality across multiple datasets and domains, with compact 8B-scale experts delivering competitive performance relative to larger evaluators. The framework’s modularity and reliance on self-refinement enhance adaptability, allowing substitution of experts and domain-specific customization to meet diverse practical evaluation needs.

Abstract

Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.

Integrated Framework for LLM Evaluation with Answer Generation

TL;DR

SPEED is an active, multi-dimensional evaluation framework for LLMs that leverages domain-specific functional experts to diagnose hallucinations, toxicity, and lexical-context quality. It jointly generates reliable reference answers and analyzes candidate outputs through a three-stage process—diverse prompting, expert-backed feedback, and evaluation—yielding more interpretable and fair assessments than fixed benchmarks. Empirical results show that SPEED improves answer quality across multiple datasets and domains, with compact 8B-scale experts delivering competitive performance relative to larger evaluators. The framework’s modularity and reliance on self-refinement enhance adaptability, allowing substitution of experts and domain-specific customization to meet diverse practical evaluation needs.

Abstract

Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.

Paper Structure

This paper contains 33 sections, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Example responses from two AI assistants (Assistant A: Qwen2.5-7B-instruct, Assistant B: Llama3-8B-instruct)
  • Figure 2: Schematic of SPEED framework
  • Figure 3: Overall pipeline of SPEED. In the diverse prompting state, the domain model selects the optimal response from the various outputs generated through diverse prompts. In the feedback stage, TE and HE analyze the selected response and provide feedback, and the domain model incorporates this feedback to revise the response. In the evaluation stage, functional experts analyzes and evaluation target based on the final reference answer across three key dimensions: hallucination, toxicity, and contextual appropriateness.