Table of Contents
Fetching ...

A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

Kentaro Imajo, Masanori Hirano, Shuji Suzuki, Hiroaki Mikami

TL;DR

This work tackles the challenge of evaluating open-ended LLM generation without ground truth or human/LLM judges by introducing a judge-free benchmark grounded in the distributional hypothesis. It constructs 50 questions with reference answer sets derived from multiple Japanese LLMs and applies rule-based and frequency-based filters, followed by refining to 1,000 representative responses per question. Evaluation uses three metrics—Fluency, Truthfulness, and Helpfulness—with explicit n-gram based formulations and length-discounting, yielding a final score as their average. The approach shows a strong correlation with GPT-4o judgments and aligns well with existing Japanese benchmarks, offering a scalable, resource-efficient alternative for assessing open-ended generation capabilities. The work also discusses limitations and future extensions to broader, non-Q&A generation tasks.

Abstract

Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.

A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

TL;DR

This work tackles the challenge of evaluating open-ended LLM generation without ground truth or human/LLM judges by introducing a judge-free benchmark grounded in the distributional hypothesis. It constructs 50 questions with reference answer sets derived from multiple Japanese LLMs and applies rule-based and frequency-based filters, followed by refining to 1,000 representative responses per question. Evaluation uses three metrics—Fluency, Truthfulness, and Helpfulness—with explicit n-gram based formulations and length-discounting, yielding a final score as their average. The approach shows a strong correlation with GPT-4o judgments and aligns well with existing Japanese benchmarks, offering a scalable, resource-efficient alternative for assessing open-ended generation capabilities. The work also discusses limitations and future extensions to broader, non-Q&A generation tasks.

Abstract

Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs' open-ended generation capabilities.

Paper Structure

This paper contains 26 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Evaluation outline
  • Figure 2: Relationships between Fluency, Truthfulness, and Helpfulness
  • Figure 3: Benchmark score comparison between using one of three LLMs to construct a reference answer set
  • Figure 4: Comparison between our benchmark and LLM-as-a-judge
  • Figure 5: Comparison between our benchmark and Nejumi LLM Leaderboard 3
  • ...and 1 more figures