Table of Contents
Fetching ...

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning

Xiaorong Wang, Ting Yang, Zhu Zhang, Shuo Wang, Zihan Zhou, Liner Yang, Zhiyuan Liu, Maosong Sun

TL;DR

Monocle addresses the challenge of evaluating long-form generation by dividing the task into localized chunk assessments (local evaluation) and a subsequent global synthesis (global evaluation). It augments this with hybrid in-context learning that incorporates human annotations and explanations, and an uncertainty-based active learning strategy to selectively annotate informative samples. The ReliGen benchmark is introduced to meta-evaluate long-form assessment methods using paper-writing tasks, demonstrating Monocle's superior alignment with human judgments across multiple models and settings. The approach reduces annotation cost while delivering robust, reference-free evaluation suitable for practical, long-form generation scenarios.

Abstract

Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.

Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning

TL;DR

Monocle addresses the challenge of evaluating long-form generation by dividing the task into localized chunk assessments (local evaluation) and a subsequent global synthesis (global evaluation). It augments this with hybrid in-context learning that incorporates human annotations and explanations, and an uncertainty-based active learning strategy to selectively annotate informative samples. The ReliGen benchmark is introduced to meta-evaluate long-form assessment methods using paper-writing tasks, demonstrating Monocle's superior alignment with human judgments across multiple models and settings. The approach reduces annotation cost while delivering robust, reference-free evaluation suitable for practical, long-form generation scenarios.

Abstract

Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.

Paper Structure

This paper contains 43 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration of the proposed hybrid local-global in-context evaluation framework. Using a divide-and-conquer approach, the evaluation is split into two stages. In the local evaluation stage, LLM-based local judges use demonstrations with human-assigned scores and model-generated explanations to assess the quality of each chunk. In the global evaluation stage, the global judge combines the local scores into a final assessment using score aggregation examples and corresponding global explanations.
  • Figure 2: Illustration of the proposed uncertainty-based activate learning method.
  • Figure 3: Distribution of papers in ReliGen.
  • Figure 4: Visualization of the scores assigned by both Monocle and the HelloEval baseline, where both methods incorporate a small amount of human annotation to enhance the reliability of the evaluation results.
  • Figure 5: Length distribution of the reference papers and the model responses in ReliGen.
  • ...and 5 more figures