Table of Contents
Fetching ...

SpeechQE: Estimating the Quality of Direct Speech Translation

HyoJung Han, Kevin Duh, Marine Carpuat

TL;DR

It is argued that quality estimation of speech translation needs to be studied as a separate problem from that of text, and end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems.

Abstract

Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our data and models to guide further research in this space.

SpeechQE: Estimating the Quality of Direct Speech Translation

TL;DR

It is argued that quality estimation of speech translation needs to be studied as a separate problem from that of text, and end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems.

Abstract

Recent advances in automatic quality estimation for machine translation have exclusively focused on written language, leaving the speech modality underexplored. In this work, we formulate the task of quality estimation for speech translation (SpeechQE), construct a benchmark, and evaluate a family of systems based on cascaded and end-to-end architectures. In this process, we introduce a novel end-to-end system leveraging pre-trained text LLM. Results suggest that end-to-end approaches are better suited to estimating the quality of direct speech translation than using quality estimation systems designed for text in cascaded systems. More broadly, we argue that quality estimation of speech translation needs to be studied as a separate problem from that of text, and release our data and models to guide further research in this space.

Paper Structure

This paper contains 37 sections, 4 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Quality Estimation for Speech Translation (SpeechQE) vs. Text Quality Estimation (text-QE).
  • Figure 2: Comparing cascaded and end-to-end approaches to Quality Estimation for Speech Translation (SpeechQE).
  • Figure 3: Prompt template of SpeechQE (quality estimation for speech translation), ASR, ST, and SpeechESD (error span detection for ST) task.