Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Shanbo Cheng; Zhichao Huang; Tom Ko; Hang Li; Ningxin Peng; Lu Xu; Qini Zhang

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Shanbo Cheng, Zhichao Huang, Tom Ko, Hang Li, Ningxin Peng, Lu Xu, Qini Zhang

TL;DR

Inspired by professional human interpreters, CLASI employs a novel data-driven read-write strategy to balance the translation quality and latency, and employs a multi-modal retrieving module to obtain relevant information to augment the translation.

Abstract

In this paper, we present Cross Language Agent -- Simultaneous Interpretation, CLASI, a high-quality and human-like Simultaneous Speech Translation (SiST) System. Inspired by professional human interpreters, we utilize a novel data-driven read-write strategy to balance the translation quality and latency. To address the challenge of translating in-domain terminologies, CLASI employs a multi-modal retrieving module to obtain relevant information to augment the translation. Supported by LLMs, our approach can generate error-tolerated translation by considering the input audio, historical context, and retrieved information. Experimental results show that our system outperforms other systems by significant margins. Aligned with professional human interpreters, we evaluate CLASI with a better human evaluation metric, valid information proportion (VIP), which measures the amount of information that can be successfully conveyed to the listeners. In the real-world scenarios, where the speeches are often disfluent, informal, and unclear, CLASI achieves VIP of 81.3% and 78.0% for Chinese-to-English and English-to-Chinese translation directions, respectively. In contrast, state-of-the-art commercial or open-source systems only achieve 35.4% and 41.6%. On the extremely hard dataset, where other systems achieve under 13% VIP, CLASI can still achieve 70% VIP.

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 5 figures, 11 tables)

This paper contains 34 sections, 2 equations, 5 figures, 11 tables.

Introduction
Methods
Framework
Architecture
Data Driven Read-Write Policy: <INPUT> and <OUTPUT>
Context Information: <LOAD_MEM> and <UPDATE_MEM>
Multi-Modal Retrieval Augmented Generation: <RETRIEVE>
Multi-Stage Training
Pretraining
Multi-task Continual Training
Multi-task Supervised Fine-tuning
Multi-Modal Retriever Training
Experiments
Evaluation Benchmark
Baselines
...and 19 more sections

Figures (5)

Figure 1: Performance evaluation. CLASI significantly outperforms the leading commercial and open-source systems using a more reliable VIP metric, achieving human interpreter parity.
Figure 2: Overall framework of CLASI. The process begins in Step 1, where CLASI processes the incoming audio data. Optionally, the retriever is activated to obtain the relevant information from the external knowledge database. For instance, translating "伊辛模型" to "Ising model" for accurate speech translation. Step 3 involves accessing transcription (optional) and translation in the last round memory. Steps 4 and 5 entail using the Chain-of-Thought (CoT) method to generate both the transcription (optional) and translation, followed by a memory update. The cycle then repeats from Step 1 for the subsequent speech segment.
Figure 3: Architecture of CLASI agent. At round $r$, our model processes the current input audio stream alongside the memory from the previous round ($r-1$), and any retrieved knowledge. CLASI generates a response based on specified instructions and concurrently updates its memory. Additionally, the model determines the cut-off timestamp of the last semantic chunk. For instance, in the provided example, the phrase preceding "就在" is identified as a complete semantic chunk, with the cut-off timestamp positioned right after this phrase.
Figure 4: Analysis of VIP vs different automatic metrics on the zh-en direction. The distribution and regression curve of the data points for each metric are shown in the above-left figure. Line charts for the calculated correlation between VIP and Automatic metric within multiple intervals are shown in the right figure. Due to the limitation of human labeling capacity, we collect 35 rounds of human evaluation results for zh-en direction on our in-house testset.
Figure 5: The first column indicates the golden transcription of the source text. Each row indicates one semantic fragment split by human evaluators. The second column is the translation results of CLASI. The third and fourth columns indicate the validity of translation and reference translation, respectively. In this case, the VIP is 24/29 == 82.8%.

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

TL;DR

Abstract

Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Authors

TL;DR

Abstract

Table of Contents

Figures (5)