Table of Contents
Fetching ...

Non-Linear Scoring Model for Translation Quality Evaluation

Serge Gladkoff, Lifeng Han, Katerina Gasova

TL;DR

The paper shows that linear error scaling in MQM-based translation quality evaluation misaligns with human judgment as sample size changes. It proposes a two-parameter non-linear model $E(x)=a\ln(1+b x)$ anchored by two tolerance points and calibrated via a one-dimensional root-finding step, grounded in psychophysics (Weber–Fechner) and Cognitive Load Theory. Empirical data from three enterprise environments demonstrate that acceptable error counts grow sublinearly with text length, with the log model closely matching expert judgments ($R^2$ ≈ 0.95) and outperforming linear fits. The approach yields improved interpretability, fairness, and inter-rater reliability, and extends naturally to AI-generated content and document-level evaluation within a unified quality scale suitable for CAT/LQA tools and automation.

Abstract

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

Non-Linear Scoring Model for Translation Quality Evaluation

TL;DR

The paper shows that linear error scaling in MQM-based translation quality evaluation misaligns with human judgment as sample size changes. It proposes a two-parameter non-linear model anchored by two tolerance points and calibrated via a one-dimensional root-finding step, grounded in psychophysics (Weber–Fechner) and Cognitive Load Theory. Empirical data from three enterprise environments demonstrate that acceptable error counts grow sublinearly with text length, with the log model closely matching expert judgments ( ≈ 0.95) and outperforming linear fits. The approach yields improved interpretability, fairness, and inter-rater reliability, and extends naturally to AI-generated content and document-level evaluation within a unified quality scale suitable for CAT/LQA tools and automation.

Abstract

Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

Paper Structure

This paper contains 36 sections, 36 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Client 1 raw questionnaire responses. X‑axis: pages ($250$ words/page); Y‑axis: maximum allowed number of minor errors.
  • Figure 2: Client 2 raw questionnaire responses. X‑axis: pages ($250$ words/page); Y‑axis: maximum allowed number of minor errors.
  • Figure 3: Client 3 raw questionnaire responses. X‑axis: pages ($250$ words/page); Y‑axis: maximum allowed number of errors (two series shown, minor and major errors).
  • Figure 4: Three interpolations. The spreadsheet curve $c+k\ln x$ is shown for comparison only; calibration uses $E(x)=a\ln(1+bx)$, which passes through the origin and is anchorable to tolerance points.
  • Figure 5: Linear model (blue) vs. human perception (PFT, red). Yellow bands indicate the $\pm20\%$ fidelity zone around the linear anchor; outside these bands the linear rule over‑ or under‑estimates true tolerance.
  • ...and 3 more figures