Decoder-only Streaming Transformer for Simultaneous Translation

Shoutao Guo; Shaolei Zhang; Yang Feng

Decoder-only Streaming Transformer for Simultaneous Translation

Shoutao Guo, Shaolei Zhang, Yang Feng

TL;DR

This work tackles real-time translation by exploring a Decoder-only architecture for Simultaneous Machine Translation (SiMT), addressing training and inference hurdles caused by re-encoding and prefix propagation. It introduces the Decoder-only Streaming Transformer (DST) with separate positional encodings for source and target prefixes and a Streaming Self-Attention (SSA) mechanism to learn translation policies from partial inputs. The training objective adds summation, latency, and consistency constraints plus a curriculum strategy to align training with prefix-based inference, while inference uses a threshold on accumulated attention to decide when to generate. Experimental results on three standard SiMT benchmarks demonstrate state-of-the-art performance, favorable latency–quality tradeoffs, and improved efficiency, highlighting the viability of decoder-only architectures for real-time translation.

Abstract

Simultaneous Machine Translation (SiMT) generates translation while reading source tokens, essentially producing the target prefix based on the source prefix. To achieve good performance, it leverages the relationship between source and target prefixes to exact a policy to guide the generation of translations. Although existing SiMT methods primarily focus on the Encoder-Decoder architecture, we explore the potential of Decoder-only architecture, owing to its superior performance in various tasks and its inherent compatibility with SiMT. However, directly applying the Decoder-only architecture to SiMT poses challenges in terms of training and inference. To alleviate the above problems, we propose the first Decoder-only SiMT model, named Decoder-only Streaming Transformer (DST). Specifically, DST separately encodes the positions of the source and target prefixes, ensuring that the position of the target prefix remains unaffected by the expansion of the source prefix. Furthermore, we propose a Streaming Self-Attention (SSA) mechanism tailored for the Decoder-only architecture. It is capable of obtaining translation policy by assessing the sufficiency of input source information and integrating with the soft-attention mechanism to generate translations. Experiments demonstrate that our approach achieves state-of-the-art performance on three translation tasks.

Decoder-only Streaming Transformer for Simultaneous Translation

TL;DR

Abstract

Paper Structure (27 sections, 19 equations, 5 figures, 10 tables)

This paper contains 27 sections, 19 equations, 5 figures, 10 tables.

Introduction
Background
Simultaneous Machine Translation
Masked Self-Attention
Method
Model Architecture
Streaming Self-Attention
Inference
Training Method
Summation Constraint
Latency Constraint
Consistency Constraint
Curriculum Learning Strategy
Experiments
Datasets
...and 12 more sections

Figures (5)

Figure 1: Comparison of Encoder-Decoder architecture and Decoder-only architecture.
Figure 2: The architecture of DST. It shows the moment when DST generates $y_2$ after reading two source tokens.
Figure 3: Comparison of our approach with other SiMT methods on En$\rightarrow$Vi, En$\rightarrow$Ro and De$\rightarrow$En tasks.
Figure 4: The comparison of hallucination in translations generated by different SiMT models. The results are based on the De$\rightarrow$En dataset.
Figure 5: The illustration of cost matrix and attention allocation matrix. In this diagram, $I$ and $J$ are both set to $5$, and $\epsilon$ is set to $1$.

Decoder-only Streaming Transformer for Simultaneous Translation

TL;DR

Abstract

Decoder-only Streaming Transformer for Simultaneous Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)