ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
TL;DR
ExpertLongBench introduces a multi-domain, expert-level benchmark with 11 long-form tasks (1050 samples) that reflect real-world workflows and require outputs exceeding thousands of tokens. It pairs each task with expert-crafted rubrics and ground-truth checklist-mapped references, and proposes CLEAR, a grounded, checklist-based evaluation framework that maps model and reference outputs to per-item criteria. In extensive experiments with 13 LLMs, current models struggle on end-to-end expert tasks (average F1 around 33.4) even as some content aligns with checklist aspects, highlighting a gap between surface conformity and correctness. The paper demonstrates that open-weight components can substitute proprietary judges for scalable, reproducible evaluation and analyzes task difficulty and skill decomposition to guide future benchmarking and model development. Overall, ExpertLongBench and CLEAR offer a rigorous, low-cost path toward evaluating and improving expert-level long-form generation in real-world workflows.
Abstract
This paper introduces ExpertLongBench, an expert-level benchmark containing 11 tasks from 9 domains that reflect realistic expert workflows and applications. Beyond question answering, the application-driven tasks in ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and strict adherence to domain-specific requirements. Notably, each task in ExpertLongBench includes a rubric, designed or validated by domain experts, to specify task requirements and guide output evaluation. Furthermore, we propose CLEAR, an evaluation framework that supports accurate evaluation of long-form model outputs in our benchmark. To achieve fine-grained, expert-aligned evaluation, CLEAR derives checklists from both model outputs and references by extracting information corresponding to items in the task-specific rubric. Checklist items of model outputs are then compared with corresponding items of reference outputs to assess their correctness, enabling grounded evaluation. We benchmark 13 popular large language models (LLMs) and analyze components in CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro achieving only a 33.4 F1 score, require significant improvement for expert-level tasks; (2) models can generate content corresponding to the required aspects, but far from correct; and (3) accurate checklist extraction and comparison in CLEAR can be achieved by open-weight models for more scalable, reproducible, and low-cost usage.
