Table of Contents
Fetching ...

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

Marco Gaido, Sara Papi, Mauro Cettolo, Matteo Negri, Luisa Bentivogli

TL;DR

Streaming Speech-to-Text Translation (StreamST) requires translating concurrent with incoming speech under tight latency constraints, with two main paradigms: re-translation and incremental decoding. The authors present simulstream, an open-source framework enabling unified evaluation and live demonstration of StreamST systems, supporting long-form audio, token-level emission/deletion tracking, and an interactive web interface. The toolkit provides quality and latency metrics (BLEU, COMET, StreamLAAL) plus additional statistics like NE and RTF, with a re-segmentation step via $mweralign$ to align outputs to sentence-level references. Experiments on MuST-C across eight language pairs compare Canary and SeamlessM4T under multiple configurations, revealing that incremental decoding can offer favorable quality-latency trade-offs, while VAD-based wrappers reduce flicker and cost at some quality trade-offs; overall, simulstream serves as a practical, extensible platform for benchmarking and demonstrating StreamST approaches.

Abstract

Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.

Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems

TL;DR

Streaming Speech-to-Text Translation (StreamST) requires translating concurrent with incoming speech under tight latency constraints, with two main paradigms: re-translation and incremental decoding. The authors present simulstream, an open-source framework enabling unified evaluation and live demonstration of StreamST systems, supporting long-form audio, token-level emission/deletion tracking, and an interactive web interface. The toolkit provides quality and latency metrics (BLEU, COMET, StreamLAAL) plus additional statistics like NE and RTF, with a re-segmentation step via to align outputs to sentence-level references. Experiments on MuST-C across eight language pairs compare Canary and SeamlessM4T under multiple configurations, revealing that incremental decoding can offer favorable quality-latency trade-offs, while VAD-based wrappers reduce flicker and cost at some quality trade-offs; overall, simulstream serves as a practical, extensible platform for benchmarking and demonstrating StreamST approaches.

Abstract

Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.

Paper Structure

This paper contains 15 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Architecture of the simulstream tool.
  • Figure 2: Screenshot of the web interface.
  • Figure 3: Latency (StreamLAAL$\downarrow$) - Quality (COMET$\uparrow$) curves of Sliding-window re-translation and StreamAtt incremental methods on SeamlessM4T v1 medium. Dashed lines indicate computationally aware latency, while solid lines computationally unaware.