From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications
Yongqiang Ma, Lizhi Qing, Jiawei Liu, Yangyang Kang, Yue Zhang, Wei Lu, Xiaozhong Liu, Qikai Cheng
TL;DR
This work reframes LLM evaluation from model-centric scores to human-centered assessment by introducing Revision Distance ($D_{Revision}$), which counts the revision edits a user-like LLM performs to refine a draft toward an ideal text. By simulating human editing with an editor LLM ($LLM_{User}$) and a writer LLM ($LLM_{gen}$), the metric yields self-explained feedback in JSON and supports both reference-based and reference-free settings. Across easy-writing and challenging academic-writing tasks, $D_{Revision}$ aligns with established metrics yet offers finer discrimination, and it shows substantial alignment with human judgments in the absence of references (about 76%). The approach provides practical, transparent insights for developers and end-users, highlighting revision-type patterns and enabling targeted improvements, albeit with GPT-4 cost considerations and opportunities for dynamic revision weighting in future work.
Abstract
Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed ``Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, ``Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.
