DOCUEVAL: An LLM-based AI Engineering Tool for Building Customisable Document Evaluation Workflows
Hao Zhang, Qinghua Lu, Liming Zhu
TL;DR
The paper addresses the challenges of applying LLMs to document evaluation—namely customization, accuracy, scalability, privacy, and governance—by introducing DOCUEVAL, an AI engineering tool that builds customizable document evaluation workflows across six architectural layers. It enables theory-grounded reviewer roles, configurable evaluation criteria, and varied reasoning strategies, all with comprehensive traceability through logs, source attribution, and versioned configurations. The authors demonstrate practicality with a real-world academic peer-review use case, showing how tailored workflows, RAG-enabled context, and human-AI collaboration yield interpretable, auditable evaluations. The work offers a blueprint for deploying trustworthy, scalable AI-assisted document evaluation in professional settings and outlines future directions toward evaluation-driven learning for continual improvement.
Abstract
Foundation models, such as large language models (LLMs), have the potential to streamline evaluation workflows and improve their performance. However, practical adoption faces challenges, such as customisability, accuracy, and scalability. In this paper, we present DOCUEVAL, an AI engineering tool for building customisable DOCUment EVALuation workflows. DOCUEVAL supports advanced document processing and customisable workflow design which allow users to define theory-grounded reviewer roles, specify evaluation criteria, experiment with different reasoning strategies and choose the assessment style. To ensure traceability, DOCUEVAL provides comprehensive logging of every run, along with source attribution and configuration management, allowing systematic comparison of results across alternative setups. By integrating these capabilities, DOCUEVAL directly addresses core software engineering challenges, including how to determine whether evaluators are "good enough" for deployment and how to empirically compare different evaluation strategies. We demonstrate the usefulness of DOCUEVAL through a real-world academic peer review case, showing how DOCUEVAL enables both the engineering of evaluators and scalable, reliable document evaluation.
