Table of Contents
Fetching ...

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

Mauro Cettolo, Marco Gaido, Matteo Negri, Sara Papi, Luisa Bentivogli

TL;DR

This work addresses evaluating speech-to-text translation (ST) using source-aware neural MT metrics when gold source transcripts are unavailable. It investigates two textual proxies—ASR transcripts and back-translations of the reference—and introduces a cross-lingual two-stage re-segmentation pipeline (XL-Segmenter followed by XLR-Segmenter) to align synthetic sources with reference translations. Across two large benchmarks (MuST-C and Europarl-ST) and 79 language pairs, the study shows ASR-based sources yield higher correlation with gold scores when $WER\le 20\%$, while BT provides a cost-effective alternative otherwise; both proxies enable reliable use of source-aware metrics in ST evaluation. The findings offer practical guidelines and robust methods for principled ST evaluation in real-world settings, facilitating more accurate benchmarking and comparability across systems and languages.

Abstract

Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

How to Evaluate Speech Translation with Source-Aware Neural MT Metrics

TL;DR

This work addresses evaluating speech-to-text translation (ST) using source-aware neural MT metrics when gold source transcripts are unavailable. It investigates two textual proxies—ASR transcripts and back-translations of the reference—and introduces a cross-lingual two-stage re-segmentation pipeline (XL-Segmenter followed by XLR-Segmenter) to align synthetic sources with reference translations. Across two large benchmarks (MuST-C and Europarl-ST) and 79 language pairs, the study shows ASR-based sources yield higher correlation with gold scores when , while BT provides a cost-effective alternative otherwise; both proxies enable reliable use of source-aware metrics in ST evaluation. The findings offer practical guidelines and robust methods for principled ST evaluation in real-world settings, facilitating more accurate benchmarking and comparability across systems and languages.

Abstract

Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.

Paper Structure

This paper contains 29 sections, 8 figures, 20 tables, 1 algorithm.

Figures (8)

  • Figure 1: Scheme of XL-Segmenter and XLR-Segmenter.
  • Figure 2: The plane of these scatter charts is defined by the WER and MetricX scores of the ASR and BT sources, respectively. For all possible comparisons between the MetricX correlation with the ASR source and that with the BT source, computed on all language pairs of the two corpora and for all ST systems, the two charts show where the cases in which it was preferable to use the ASR (on the left) or the BT (on the right) as a source for the computation of MetricX are placed in that plane. Biased ASR MetricXs, i.e., those of ST systems that are somehow involved in the generation of the ASR source of the metric, are excluded. The total number of points is 1672, 1315 on the left (ASR wins, 78.6%), 357 on the right (BT wins, 21.4%). A random 1% change was applied to all values to avoid the overlapping of points and make all of them visible.
  • Figure 3: For all possible comparisons between the MetricX correlation with the ASR source and that with the BT source, computed on all language pairs of the two corpora and for all ST systems, the histograms illustrate the distribution of cases in which the standard MetricX shows a higher correlation with MetricX using either the ASR or the BT as source input. Biased ASR MetricXs, i.e., those of ST systems that are somehow involved in the generation of the ASR source of the metric, are excluded. The left chart reflects this distribution as a function of transcription quality (WER), while the right chart does so with respect to (back-)translation quality (MetricX).
  • Figure 4: The plane of these scatter charts is defined by the WER and MetricX scores of the ASR and BT sources, respectively. For all possible comparisons between the MetricX correlation with the ASR source and that with the BT source, computed on all language pairs of the two corpora and for all ST systems, the two charts show where the cases in which it was preferable to use the ASR (on the left) or the BT (on the right) as the source for the computation of MetricX are placed in that plane. Biased ASR MetricXs, i.e. those of ST systems that are somehow involved in the generation of the ASR source of the metric, are excluded. The total number of points is 3440, 2104 on the left (ASR wins, 61.2%), 1336 on the right (BT wins, 38.8%). A random 1% change was applied to all values in order to avoid the overlapping of points and make all of them visible.
  • Figure 5: For all possible comparisons between the MetricX correlation with the ASR source and that with the BT source, computed on all language pairs of the two corpora and for all ST systems, these histograms illustrate the distribution of cases in which the standard MetricX shows a higher correlation with MetricX using either the ASR or the BT as source input. Biased ASR MetricXs, i.e. those of ST systems that are somehow involved in the generation of the ASR source of the metric, are excluded. The left chart reflects this distribution as a function of transcription quality (WER), while the right chart does so with respect to (back-)translation quality (MetricX).
  • ...and 3 more figures