Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning
Xiaorong Wang, Ting Yang, Zhu Zhang, Shuo Wang, Zihan Zhou, Liner Yang, Zhiyuan Liu, Maosong Sun
TL;DR
Monocle addresses the challenge of evaluating long-form generation by dividing the task into localized chunk assessments (local evaluation) and a subsequent global synthesis (global evaluation). It augments this with hybrid in-context learning that incorporates human annotations and explanations, and an uncertainty-based active learning strategy to selectively annotate informative samples. The ReliGen benchmark is introduced to meta-evaluate long-form assessment methods using paper-writing tasks, demonstrating Monocle's superior alignment with human judgments across multiple models and settings. The approach reduces annotation cost while delivering robust, reference-free evaluation suitable for practical, long-form generation scenarios.
Abstract
Assessing the quality of long-form, model-generated text is challenging, even with advanced LLM-as-a-Judge methods, due to performance degradation as input length increases. To address this issue, we propose a divide-and-conquer approach, which breaks down the comprehensive evaluation task into a series of localized scoring tasks, followed by a final global assessment. This strategy allows for more granular and manageable evaluations, ensuring that each segment of the text is assessed in isolation for both coherence and quality, while also accounting for the overall structure and consistency of the entire piece. Moreover, we introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations. By incorporating human-generated feedback directly into the evaluation process, this method allows the model to better align with human judgment. Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation, thereby reducing annotation costs in practical scenarios. Experimental results show that the proposed evaluation framework outperforms several representative baselines, highlighting the effectiveness of our approach.
