Table of Contents
Fetching ...

Assessing Human Editing Effort on LLM-Generated Texts via Compression-Based Edit Distance

Nicolas Devatine, Louis Abraham

TL;DR

This work tackles the challenge of quantifying human editing effort on texts produced by large language models, arguing that traditional similarity metrics fail to capture complex edits. It introduces a compression-based edit distance grounded in the LZ77 algorithm, defined as $d(S \rightarrow T) = LZ(S|T) - LZ(S)$, and implements a linear-time approximation leveraging LZ factorization. The authors provide a high-quality dataset of accounting Q&A edits (both human and synthetic) and evaluate on the IWSLT 2019 benchmark, showing that the compression distance correlates more strongly with actual edit time and keystrokes than common metrics. Their results also reveal that LLMs exhibit speed-aware editing behavior consistent with the proposed metric, and the approach offers scalable, interpretable measurement with practical implications for model evaluation and human-AI collaboration.

Abstract

Assessing the extent of human edits on texts generated by Large Language Models (LLMs) is crucial to understanding the human-AI interactions and improving the quality of automated text generation systems. Existing edit distance metrics, such as Levenshtein, BLEU, ROUGE, and TER, often fail to accurately measure the effort required for post-editing, especially when edits involve substantial modifications, such as block operations. In this paper, we introduce a novel compression-based edit distance metric grounded in the Lempel-Ziv-77 algorithm, designed to quantify the amount of post-editing applied to LLM-generated texts. Our method leverages the properties of text compression to measure the informational difference between the original and edited texts. Through experiments on real-world human edits datasets, we demonstrate that our proposed metric is highly correlated with actual edit time and effort. We also show that LLMs exhibit an implicit understanding of editing speed, that aligns well with our metric. Furthermore, we compare our metric with existing ones, highlighting its advantages in capturing complex edits with linear computational efficiency. Our code and data are available at: https://github.com/NDV-tiime/CompressionDistance

Assessing Human Editing Effort on LLM-Generated Texts via Compression-Based Edit Distance

TL;DR

This work tackles the challenge of quantifying human editing effort on texts produced by large language models, arguing that traditional similarity metrics fail to capture complex edits. It introduces a compression-based edit distance grounded in the LZ77 algorithm, defined as , and implements a linear-time approximation leveraging LZ factorization. The authors provide a high-quality dataset of accounting Q&A edits (both human and synthetic) and evaluate on the IWSLT 2019 benchmark, showing that the compression distance correlates more strongly with actual edit time and keystrokes than common metrics. Their results also reveal that LLMs exhibit speed-aware editing behavior consistent with the proposed metric, and the approach offers scalable, interpretable measurement with practical implications for model evaluation and human-AI collaboration.

Abstract

Assessing the extent of human edits on texts generated by Large Language Models (LLMs) is crucial to understanding the human-AI interactions and improving the quality of automated text generation systems. Existing edit distance metrics, such as Levenshtein, BLEU, ROUGE, and TER, often fail to accurately measure the effort required for post-editing, especially when edits involve substantial modifications, such as block operations. In this paper, we introduce a novel compression-based edit distance metric grounded in the Lempel-Ziv-77 algorithm, designed to quantify the amount of post-editing applied to LLM-generated texts. Our method leverages the properties of text compression to measure the informational difference between the original and edited texts. Through experiments on real-world human edits datasets, we demonstrate that our proposed metric is highly correlated with actual edit time and effort. We also show that LLMs exhibit an implicit understanding of editing speed, that aligns well with our metric. Furthermore, we compare our metric with existing ones, highlighting its advantages in capturing complex edits with linear computational efficiency. Our code and data are available at: https://github.com/NDV-tiime/CompressionDistance

Paper Structure

This paper contains 22 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of our dataset construction process. We sampled $200$ questions (and associated expert knowledge) from a Q&A knowledge base. First, each question is answered by an LLM without being provided the expert knowledge. Then, these answers are edited by either a human or an LLM with the expert knowledge provided, resulting in a final post-edited answer. Three scenarios are considered when editing is done by an LLM: normal, similar, fast). For human edits, we measured the edit times (in seconds).
  • Figure 2: Length distribution of initial LLM answers and distribution of the compression-based edit distances for the different editing scenarios in the synthetic dataset.
  • Figure 3: Comparison of compression distances on our synthetic dataset between the normal editing scenario and the similar (left) and fast (right) editing scenarios. Each subplot shows scatter points and a fitted linear regression line (in red). The blue line $x=y$ is shown for reference.
  • Figure 4: Scatter plots with linear regression fits and confidence intervals for various metrics against measured edit times on our human post-edited dataset when concatenating the expert knowledge text to the initial LLM output.