Table of Contents
Fetching ...

Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study

Wenwen Xie, Gray Gwizdz, Dongji Feng

TL;DR

This work tackles the misweighting problem in LLM-based evaluation by introducing a two-step framework that embeds explicit error weighting into prompts. Through a case study on Databricks documentation, it demonstrates that weighting critical vs. supporting vs. trivial facts improves alignment with human judgments, achieving an average HAR gain of about 6.4% and a top HAR of 95.8% with Mixtral-8x7B Instruct. The study analyzes score distributions and statistics to reveal how model size and prompt design affect reliability and variability, highlighting trade-offs between accuracy and consistency. The findings suggest that carefully engineered prompts can significantly enhance the reliability of LLMs as evaluators in domain-specific NLG tasks, with practical implications for scalable, human-aligned automated evaluation.

Abstract

While Large Language Models (LLMs) have emerged as promising tools for evaluating Natural Language Generation (NLG) tasks, their effectiveness is limited by their inability to appropriately weigh the importance of different topics, often overemphasizing minor details while undervaluing critical information, leading to misleading assessments. Our work proposes an efficient prompt design mechanism to address this specific limitation and provide a case study. Through strategic prompt engineering that incorporates explicit importance weighting mechanisms, we enhance using LLM-as-a-Judge ability to prioritize relevant information effectively, as demonstrated by an average improvement of 6% in the Human Alignment Rate (HAR) metric.

Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study

TL;DR

This work tackles the misweighting problem in LLM-based evaluation by introducing a two-step framework that embeds explicit error weighting into prompts. Through a case study on Databricks documentation, it demonstrates that weighting critical vs. supporting vs. trivial facts improves alignment with human judgments, achieving an average HAR gain of about 6.4% and a top HAR of 95.8% with Mixtral-8x7B Instruct. The study analyzes score distributions and statistics to reveal how model size and prompt design affect reliability and variability, highlighting trade-offs between accuracy and consistency. The findings suggest that carefully engineered prompts can significantly enhance the reliability of LLMs as evaluators in domain-specific NLG tasks, with practical implications for scalable, human-aligned automated evaluation.

Abstract

While Large Language Models (LLMs) have emerged as promising tools for evaluating Natural Language Generation (NLG) tasks, their effectiveness is limited by their inability to appropriately weigh the importance of different topics, often overemphasizing minor details while undervaluing critical information, leading to misleading assessments. Our work proposes an efficient prompt design mechanism to address this specific limitation and provide a case study. Through strategic prompt engineering that incorporates explicit importance weighting mechanisms, we enhance using LLM-as-a-Judge ability to prioritize relevant information effectively, as demonstrated by an average improvement of 6% in the Human Alignment Rate (HAR) metric.

Paper Structure

This paper contains 18 sections, 1 equation, 1 figure, 5 tables, 1 algorithm.

Figures (1)

  • Figure 1: One-Way ANOVA: Model Score Distributions.