Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin; Jeon Haesung; Lianbo Liu; Hao Shi; Mengjie Zhao; Yusuke Fujita; Yui Sudo

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Roman Koshkin, Jeon Haesung, Lianbo Liu, Hao Shi, Mengjie Zhao, Yusuke Fujita, Yui Sudo

Abstract

Simultaneous machine translation (SiMT) has traditionally relied on offline machine translation models coupled with human-engineered heuristics or learned policies. We propose Hikari, a policy-free, fully end-to-end model that performs simultaneous speech-to-text translation and streaming transcription by encoding READ/WRITE decisions into a probabilistic WAIT token mechanism. We also introduce Decoder Time Dilation, a mechanism that reduces autoregressive overhead and ensures a balanced training distribution. Additionally, we present a supervised fine-tuning strategy that trains the model to recover from delays, significantly improving the quality-latency trade-off. Evaluated on English-to-Japanese, German, and Russian, Hikari achieves new state-of-the-art BLEU scores in both low- and high-latency regimes, outperforming recent baselines.

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Abstract

Paper Structure (31 sections, 2 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 2 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Cascaded Systems and Heuristic Policies
Adaptive and Learned Policies
Foundation Models and SiMT
Policy-Free Paradigm
Method
Architecture
Data
Preparation
Causal alignment
Sources
Training
Pre-training
Supervised fine-tuning
...and 16 more sections

Figures (6)

Figure 1: The challenge and reality of simultaneous translation. (Top) Conceptual illustration of syntactic divergence between English (SVO) and Japanese (SOV); the translator must "hold" the main verb "met" until the end of the Japanese sentence. (Bottom) Actual inference timeline, showing how Hikari-medium has learned the strategic patience required for syntactically divergent languages. Horizontal segments represent the model waiting for context (READING, i.e. emitting WAIT tokens not shown for clarity), while the steep "steps" are rapid token emission (WRITING) once semantic ambiguity is resolved.
Figure 2: High-level overview of the model architecture.
Figure 3: Choosing the optimal value of decoder time dilation ($D$). Left panel: percentage of wait tokens in a 30 s en-ja sample as a function of $D \in \{1,4,6,10\}$. Lower $D$ improves precision of causal alignment, but also (undesirably) increases the dominance of WAIT tokens in the training data. Right panel: with too high a value of $D$ (e.g. 6 and 10), target tokens corresponding to 30 s of input audio might not fully fit into the decoder's context (shown by dashed vertical lines for $D \in \{1,4,6,10\}$.
Figure 4: Real-time factor (RTF) as a function of decoder time dilation ($D$) on a single A100 or H100 GPU. The plot compares Hikari-medium across varying batch sizes ($B \in \{1, 10, 20\}$). The dashed blue line indicates the real-time threshold (RTF $= 1.0$); values below this line indicate generation speed faster than the input audio stream.
Figure 5: Construction of an SFT sample. The shaded transparent box indicates the chosen tokens to be delayed. The red rectangles illustrate the tokens for which cross-entropy is calculated. Orange and green squares are non-wait and WAIT tokens, respectively.
...and 1 more figures

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Abstract

Streaming Translation and Transcription Through Speech-to-Text Causal Alignment

Authors

Abstract

Table of Contents

Figures (6)