Table of Contents
Fetching ...

DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

Xinglin Lyu, Wei Tang, Yuang Li, Xiaofeng Zhao, Ming Zhu, Junhui Li, Yunfei Lu, Min Zhang, Daimeng Wei, Hao Yang, Min Zhang

TL;DR

DoCIA addresses the challenge of leveraging document-level context in speech-to-text translation by introducing an online, four-stage cascade that adds document cues at ASR refinement, MT, and MT refinement using LLMs. It introduces a multi-level memory strategy with short- and long-memory components and a refinement-determination mechanism with threshold $\lambda$ and similarity function $g(O,I)$ to prevent hallucinations. Experiments on MuST-C across five directions with four LLMs show substantial gains over strong baselines, with larger improvements when a stronger base model is used and context is leveraged across multiple stages. The findings indicate document-level discourse context can mitigate ASR errors and enhance cross-sentence coherence in ST, suggesting a practical, scalable path for improving real-world ST systems.

Abstract

Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.

DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

TL;DR

DoCIA addresses the challenge of leveraging document-level context in speech-to-text translation by introducing an online, four-stage cascade that adds document cues at ASR refinement, MT, and MT refinement using LLMs. It introduces a multi-level memory strategy with short- and long-memory components and a refinement-determination mechanism with threshold and similarity function to prevent hallucinations. Experiments on MuST-C across five directions with four LLMs show substantial gains over strong baselines, with larger improvements when a stronger base model is used and context is leveraged across multiple stages. The findings indicate document-level discourse context can mitigate ASR errors and enhance cross-sentence coherence in ST, suggesting a practical, scalable path for improving real-world ST systems.

Abstract

Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.

Paper Structure

This paper contains 42 sections, 9 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The traditional cascade-based ST system (top) and our proposed DoCIA for ST (bottom). Differently, DoCIA introduces two refinement stages and is LLM-based and context-aware when translating $i$-th audio segment in a speech.
  • Figure 2: The overall illustration of DoCIA when translating $i$-th audio segment in a speech. The blue, purple and red lines denote the context retrieving, refinement determining and context updating processes, respectively.
  • Figure 3: Performance comparison when setting different context window size $L$.
  • Figure 4: Performance comparison when setting different combinations of $m$ and $n$.
  • Figure 5: LLM-based evaluation results for four different base models across five ST directions. For each translation, the LLM assigns a score ranging from 1 to 10 based on the provided document-level context. Tie to ASR-SMT and Tie to ASR-DMT indicate cases where the score for DoCIA’s translation is equal to or lower than the highest score achieved by ASR-SMT and ASR-DMT, respectively.
  • ...and 4 more figures