Table of Contents
Fetching ...

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Mingxuan Li, Hanchen Li, Chenhao Tan

TL;DR

HypoEval introduces a hypothesis-guided evaluation framework that leverages a small set of human judgments to generate decomposed evaluation rubrics (hypotheses) from both data and literature. A hypothesis bank is curated and refined via an exploration-exploitation strategy, and a subset of hypotheses is selected to guide a multi-dimension scoring process using a checklist-like aggregation. Across summarization and story-generation tasks, HypoEval achieves state-of-the-art alignment with human judgments while needing far fewer labeled examples, and it demonstrates robustness to out-of-distribution data, prompt variations, and changes in evaluator models. The approach provides interpretable evaluation by breaking down subjective aspects into explicit dimensions that feed into an aggregate score, offering a scalable, tuning-free alternative to traditional LLM-based evaluators.

Abstract

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

TL;DR

HypoEval introduces a hypothesis-guided evaluation framework that leverages a small set of human judgments to generate decomposed evaluation rubrics (hypotheses) from both data and literature. A hypothesis bank is curated and refined via an exploration-exploitation strategy, and a subset of hypotheses is selected to guide a multi-dimension scoring process using a checklist-like aggregation. Across summarization and story-generation tasks, HypoEval achieves state-of-the-art alignment with human judgments while needing far fewer labeled examples, and it demonstrates robustness to out-of-distribution data, prompt variations, and changes in evaluator models. The approach provides interpretable evaluation by breaking down subjective aspects into explicit dimensions that feed into an aggregate score, offering a scalable, tuning-free alternative to traditional LLM-based evaluators.

Abstract

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.

Paper Structure

This paper contains 24 sections, 5 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: A Comparison between previous methods and HypoEval. We achieve high-alignment and explainable evaluation with only a few human labels per dataset.
  • Figure 2: Results of prompt robustness study comparing HypoEval with direct scoring, where each dot in the box plots refers to a specific prompt variation. HypoEval shows significantly stronger robustness to evaluation prompts on representative evaluation settings.
  • Figure 3: Illustration of the distribution of human evaluation scores of SummEval. The scores for the consistency and fluency aspects are highly skewed towards 5, which potentially leads to the decrease in performance of HypoEval on theses aspects.