From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Yongqiang Ma; Lizhi Qing; Jiawei Liu; Yangyang Kang; Yue Zhang; Wei Lu; Xiaozhong Liu; Qikai Cheng

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Yongqiang Ma, Lizhi Qing, Jiawei Liu, Yangyang Kang, Yue Zhang, Wei Lu, Xiaozhong Liu, Qikai Cheng

TL;DR

This work reframes LLM evaluation from model-centric scores to human-centered assessment by introducing Revision Distance ($D_{Revision}$), which counts the revision edits a user-like LLM performs to refine a draft toward an ideal text. By simulating human editing with an editor LLM ($LLM_{User}$) and a writer LLM ($LLM_{gen}$), the metric yields self-explained feedback in JSON and supports both reference-based and reference-free settings. Across easy-writing and challenging academic-writing tasks, $D_{Revision}$ aligns with established metrics yet offers finer discrimination, and it shows substantial alignment with human judgments in the absence of references (about 76%). The approach provides practical, transparent insights for developers and end-users, highlighting revision-type patterns and enabling targeted improvements, albeit with GPT-4 cost considerations and opportunities for dynamic revision weighting in future work.

Abstract

Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed ``Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, ``Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

TL;DR

This work reframes LLM evaluation from model-centric scores to human-centered assessment by introducing Revision Distance (

), which counts the revision edits a user-like LLM performs to refine a draft toward an ideal text. By simulating human editing with an editor LLM (

) and a writer LLM (

), the metric yields self-explained feedback in JSON and supports both reference-based and reference-free settings. Across easy-writing and challenging academic-writing tasks,

aligns with established metrics yet offers finer discrimination, and it shows substantial alignment with human judgments in the absence of references (about 76%). The approach provides practical, transparent insights for developers and end-users, highlighting revision-type patterns and enabling targeted improvements, albeit with GPT-4 cost considerations and opportunities for dynamic revision weighting in future work.

Abstract

Paper Structure (18 sections, 1 equation, 4 figures, 5 tables)

This paper contains 18 sections, 1 equation, 4 figures, 5 tables.

Introduction
Related Work
Revision Distance
Results and Discussion
Evaluation for Reference-based Setting
Task and Dataset
Text Generation Models
Result Analysis
Evaluation for Reference-free Setting
Qualitative Analysis
Conclusion
Dataset for Reference-based Setting
Text Generation Models for Easy Writing Task
Text Generation Models for Challenge Writing Task
Example for Revision Action Item
...and 3 more sections

Figures (4)

Figure 1: Inspired by the classical edit distance metric, our "Revision Distance" $\mathbf{D}_{Revision}$ can offer a more human-centered and nuanced metric for text evaluation. As illustrated, the $\mathbf{D}_{Revision}(Draft, GroudTruth)$ can provide a more transparent evaluation result, benefiting from the generated revision edit details.
Figure 2: The evaluation flow of "Revision Distance". We require the $LLM_{User}$ to produce results in JSON format with detailed information, In this work, we primarily use the action_name to analyze the revisions.
Figure 3: An example of content-based revision. The generated revision is about simply the background introduction in the AI-generated text.
Figure 4: In this case, the difference between the two texts is the related work statement order, which represents the author’s argumentation structure

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

TL;DR

Abstract

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (4)