Table of Contents
Fetching ...

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

Peter Polák, Sara Papi, Luisa Bentivogli, Ondřej Bojar

TL;DR

This work presents the first comprehensive meta-evaluation of latency metrics across language pairs and systems, and introduces YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio.

Abstract

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.

Better Late Than Never: Meta-Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation

TL;DR

This work presents the first comprehensive meta-evaluation of latency metrics across language pairs and systems, and introduces YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio.

Abstract

Simultaneous speech-to-text translation systems must balance translation quality with latency. Although quality evaluation is well established, latency measurement remains a challenge. Existing metrics produce inconsistent results, especially in short-form settings with artificial presegmentation. We present the first comprehensive meta-evaluation of latency metrics across language pairs and systems. We uncover a structural bias in current metrics related to segmentation. We introduce YAAL (Yet Another Average Lagging) for a more accurate short-form evaluation and LongYAAL for unsegmented audio. We propose SoftSegmenter, a resegmentation tool based on soft word-level alignment. We show that YAAL and LongYAAL, together with SoftSegmenter, outperform popular latency metrics, enabling more reliable assessments of short- and long-form simultaneous speech translation systems. We implement all artifacts within the OmniSTEval toolkit: https://github.com/pe-trik/OmniSTEval.

Paper Structure

This paper contains 39 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 2: Translations and emission times of a model. Words in a column were emitted at once, last five words or tail words "gemeinnützige Organisation namens Robin Hood." depend on the segmentation: Oracle Segmentation: Known beforehand. Once the model consumes the entire sentence, it is asked to finish the translation without additional delay. Simultaneous Segmentation: The evaluation uses an online segmenter that needs extra time (the red area: approximately 0.5s) to decide when the sentence ends.
  • Figure 3: Translations and emission times of a model. Green translations are emitted simultaneously, while red translations are emitted after the end-of-segment signal (vertical dashed line). The Normal Simultaneous Policy emits the translations uniformly and has only a small fraction of tail words. The Degenerate Simultaneous Policy quickly emits a few words at the beginning, while waiting to translate the majority of words after the segment ends, effectively performing offline translation.
  • Figure 4: Each point represents the difference between the true latency (x-axis) and the automatic metric (y-axis) for two systems. Reported Pearson and Kendall rank correlations are indicative, as each language pair has a different scale.
  • Figure 5: Proposed degenerate simultaneous policy test (green area). Empty markers represent automatically filtered systems. Dotted markers are systems submitted by the teams whose systems were reported in \ref{['tab:offenders']}. Each point represents the actual and expected proportion of words translated simultaneously as observed on the short-form systems. Blue color represents En$\rightarrow$De, orange color En$\rightarrow$Ja, and green color En$\rightarrow$Zh systems, respectively. Diamonds present tst-COMMON, circles IWSLT 2022, and squares IWSLT 2023 test set systems.
  • Figure 6: Short-form accuracies after removing degenerate simultaneous policy systems based on the difference in: left: true latency, and right: YAAL values of two systems. $N$ (dashed/dotted lines) indicate the number of pairs in each group. The colored strips show the 95% confidence interval.
  • ...and 1 more figures