Table of Contents
Fetching ...

Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation

Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Yang Feng

TL;DR

This work tackles simultaneous generation by turning large language models into policy-makers that decide when to write outputs while reading streaming inputs. It introduces LLM-driven Simultaneous Generation (LSG), which fixedly references a baseline wait-1 policy and uses a KL-divergence criterion, $D_{KL}\bigl[ p(y_i\mid\mathbf{x}_{\le j},\mathbf{y}_{<i})\,\mid\mid\ p(y_i\mid\mathbf{x}_{\le i},\mathbf{y}_{<i}) \bigr] > \delta$, to trigger writing, augmented by a confidence condition $\max p(y_i\mid\mathbf{x}_{\le j},\mathbf{y}_{<i}) > \alpha$ and a range constraint on the candidate output span. The approach yields state-of-the-art results on simultaneous translation (SimulT2TT) and streaming ASR (SimulS2TT) across multiple datasets, using open-source LLMs (e.g., Llama2-7B-chat with LoRA and Qwen-Audio) to jointly perform policy-making and generation without explicit policy-learning. The results demonstrate practical streaming performance with favorable latency–quality trade-offs and show that higher-capacity LLMs further improve results, highlighting the potential of LLMs to act as autonomous policy-makers in real-time generation tasks.

Abstract

Simultaneous generation models write generation results while reading streaming inputs, necessitating a policy-maker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.

Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation

TL;DR

This work tackles simultaneous generation by turning large language models into policy-makers that decide when to write outputs while reading streaming inputs. It introduces LLM-driven Simultaneous Generation (LSG), which fixedly references a baseline wait-1 policy and uses a KL-divergence criterion, , to trigger writing, augmented by a confidence condition and a range constraint on the candidate output span. The approach yields state-of-the-art results on simultaneous translation (SimulT2TT) and streaming ASR (SimulS2TT) across multiple datasets, using open-source LLMs (e.g., Llama2-7B-chat with LoRA and Qwen-Audio) to jointly perform policy-making and generation without explicit policy-learning. The results demonstrate practical streaming performance with favorable latency–quality trade-offs and show that higher-capacity LLMs further improve results, highlighting the potential of LLMs to act as autonomous policy-makers in real-time generation tasks.

Abstract

Simultaneous generation models write generation results while reading streaming inputs, necessitating a policy-maker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.
Paper Structure (26 sections, 6 equations, 10 figures, 7 tables)

This paper contains 26 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The distribution difference of subsequent generation states compared to wait-1 policy for a German$\Rightarrow$English translation example. The distribution difference is measured by KL divergence.
  • Figure 2: The framework of LLM-driven Simultaneous Generation Model.
  • Figure 3: Performance of simultaneous generation models on De$\Rightarrow$En, En$\Rightarrow$De and Fr$\Rightarrow$En datasets. We also evaluate the Computation-Aware (CA) latency on the CoVoST2 Fr$\Rightarrow$En dataset to assess the usability of systems in real-world scenarios.
  • Figure 4: The performance of LSG framework when employing various LLMs. The results are reported on the WMT22 Chinese$\Rightarrow$English dataset.
  • Figure 5: Comparison of the policy sufficiency of different simultaneous generation policies. The experiments are based on the De$\Rightarrow$En dataset.
  • ...and 5 more figures