An Automatic Quality Metric for Evaluating Simultaneous Interpretation

Mana Makinae; Katsuhito Sudoh; Masaru Yamada; Satoshi Nakamura

An Automatic Quality Metric for Evaluating Simultaneous Interpretation

Mana Makinae, Katsuhito Sudoh, Masaru Yamada, Satoshi Nakamura

TL;DR

This work tackles the problem of evaluating simultaneous interpretation (SI) by addressing the latency–quality trade-off through word-order synchronization. It introduces two cross-lingual, rank-correlation based metrics: the Synchro metric, which leverages cross-lingual token alignments from $m$BERT and Spearman's $\rho$ (with heuristics to filter unreliable alignments using $\theta$ and function-word filtering), and the Combined metric, which fuses improved subword alignments (Awesome Align + BERTScore), computes $\rho$ for word order, and multiplies by Content Words Coverage $= \frac{n}{N}$ to capture content preservation. Empirical evaluation on NAIST-SIC-Aligned and JNPC demonstrates that SI tends to synchronize word order with the source for longer segments, and that the Combined metric can reflect human judgments in several cases by jointly considering synchronization and content coverage, though not universally due to omissions and summarization. The study provides a practical automatic evaluation framework for SI/SiMT that highlights the importance of word-order synchronization in latency-conscious translation and offers guidance for designing FIFO-like strategies and future SI-quality metrics. $\text{Key contributions include}$ (i) a principled, alignment-driven Synchro metric based on $\rho$ with reliability heuristics, (ii) a robust Combined metric integrating token-alignment and content coverage, and (iii) empirical insights linking word-order synchronization to human judgments on long sentences.

Abstract

Simultaneous interpretation (SI), the translation of one language to another in real time, starts translation before the original speech has finished. Its evaluation needs to consider both latency and quality. This trade-off is challenging especially for distant word order language pairs such as English and Japanese. To handle this word order gap, interpreters maintain the word order of the source language as much as possible to keep up with original language to minimize its latency while maintaining its quality, whereas in translation reordering happens to keep fluency in the target language. This means outputs synchronized with the source language are desirable based on the real SI situation, and it's a key for further progress in computational SI and simultaneous machine translation (SiMT). In this work, we propose an automatic evaluation metric for SI and SiMT focusing on word order synchronization. Our evaluation metric is based on rank correlation coefficients, leveraging cross-lingual pre-trained language models. Our experimental results on NAIST-SIC-Aligned and JNPC showed our metrics' effectiveness to measure word order synchronization between source and target language.

An Automatic Quality Metric for Evaluating Simultaneous Interpretation

TL;DR

BERT and Spearman's

(with heuristics to filter unreliable alignments using

and function-word filtering), and the Combined metric, which fuses improved subword alignments (Awesome Align + BERTScore), computes

for word order, and multiplies by Content Words Coverage

to capture content preservation. Empirical evaluation on NAIST-SIC-Aligned and JNPC demonstrates that SI tends to synchronize word order with the source for longer segments, and that the Combined metric can reflect human judgments in several cases by jointly considering synchronization and content coverage, though not universally due to omissions and summarization. The study provides a practical automatic evaluation framework for SI/SiMT that highlights the importance of word-order synchronization in latency-conscious translation and offers guidance for designing FIFO-like strategies and future SI-quality metrics.

(i) a principled, alignment-driven Synchro metric based on

with reliability heuristics, (ii) a robust Combined metric integrating token-alignment and content coverage, and (iii) empirical insights linking word-order synchronization to human judgments on long sentences.

Abstract

Paper Structure (26 sections, 1 figure, 21 tables)

This paper contains 26 sections, 1 figure, 21 tables.

Introduction
Related Work
Word Order Synchronization for Delay Reduction
Evaluation of Simultaneous Interpretation
Automatic MT Evaluation
Latency Evaluation
Prerequisites
Token Alignment in BERTScore
Word Order synchronization as Rank Correlation
Proposed Method
Synchro Metric
Combined Metric
Experiments
Comparison between SI and Offline Translation by Synchro Metric
Setup
...and 11 more sections

Figures (1)

Figure 1: Proposed metrics scores with varying the alignment threshold.

An Automatic Quality Metric for Evaluating Simultaneous Interpretation

TL;DR

Abstract

An Automatic Quality Metric for Evaluating Simultaneous Interpretation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)