Table of Contents
Fetching ...

CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

Xi Xu, Wenda Xu, Siqi Ouyang, Lei Li

TL;DR

It is demonstrated that this issue affects not only streaming but also segment-level latency evaluation across different metrics, and a modification to correctly measure computation-aware latency for SimulST systems is proposed, addressing the limitations present in existing metrics.

Abstract

Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.

CA*: Addressing Evaluation Pitfalls in Computation-Aware Latency for Simultaneous Speech Translation

TL;DR

It is demonstrated that this issue affects not only streaming but also segment-level latency evaluation across different metrics, and a modification to correctly measure computation-aware latency for SimulST systems is proposed, addressing the limitations present in existing metrics.

Abstract

Simultaneous speech translation (SimulST) systems must balance translation quality with response time, making latency measurement crucial for evaluating their real-world performance. However, there has been a longstanding belief that current metrics yield unrealistically high latency measurements in unsegmented streaming settings. In this paper, we investigate this phenomenon, revealing its root cause in a fundamental misconception underlying existing latency evaluation approaches. We demonstrate that this issue affects not only streaming but also segment-level latency evaluation across different metrics. Furthermore, we propose a modification to correctly measure computation-aware latency for SimulST systems, addressing the limitations present in existing metrics.

Paper Structure

This paper contains 7 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Computation-aware latency metrics tend to produce unrealistically high scores as speech duration increases. We segmented the original tst-COMMON dataset into 25s speech segments, then duplicated and concatenated them to create 50s, 75s, and 100s speech durations. The system used for evaluation is a wait-k-stride-n model, where n=3 and k=4, with each speech segment spanning 250 ms.
  • Figure 2: Computation unaware and aware $d_i$ with the corresponding oracle delay $d^*$, where the intersection represents $\tau'(|\mathbf{X}|)$. (For illustration purposes, we plot only one token for three tokens in the stride-3 SimulST system.) After conversion, the latency AL_CA only considers the first 46 outputs against the oracle, resulting from the unreliable calculation of computation elapsed, while AL_CU considers all outputs until the last speech segment.
  • Figure 3: In practice, a SimulST system alternates between reading and writing actions while receiving speech input in a continuous stream. The existing approach implicitly assumes that the time spent on generating text and processing streaming speech occurs sequentially. As a result, this can lead to unreliable accumulation of delay calculations.
  • Figure 4: Previous generation time exceeding the current segment $x_j$ introduces additional delay for generating tokens in the current frame. For the given example, if generating each token takes 1 second, after reading 1 second of speech, the system takes 2 seconds to generate 'ein Mississippi', resulting in $\beta_j$ being 1 second. The delay for $I_i$ would be the sum of the previous speech duration, $\beta_j$, and $I_i$. If there is no buffer, later tokens are not affected.