Table of Contents
Fetching ...

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen

TL;DR

This work tackles the challenge of applying batch-trained LLMs to streaming scenarios by identifying three key mismatches between batch and streaming processing: input-attention, output-attention, and position-ID. Through stepwise ablations on streaming translation tasks, the authors demonstrate that input-attention mismatch is the primary driver of performance loss, while output-attention and position-ID mismatches contribute negligibly. They further analyze the role of position encoding, showing that preserving relative order within source and target contexts matters more than maintaining absolute order, and propose a group position encoding paradigm that aligns streaming with batch processing without re-encoding. The group-streaming approach is model-agnostic, scalable, and generalizes across cross-lingual and cross-modal tasks (e.g., translation and ASR), outperforming existing streaming-specific methods while enabling seamless offline and online operation. The work includes extensive experiments, visual analyses of streaming attention, and open-source code, highlighting practical impact for real-time AI systems leveraging LLMs.

Abstract

Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

TL;DR

This work tackles the challenge of applying batch-trained LLMs to streaming scenarios by identifying three key mismatches between batch and streaming processing: input-attention, output-attention, and position-ID. Through stepwise ablations on streaming translation tasks, the authors demonstrate that input-attention mismatch is the primary driver of performance loss, while output-attention and position-ID mismatches contribute negligibly. They further analyze the role of position encoding, showing that preserving relative order within source and target contexts matters more than maintaining absolute order, and propose a group position encoding paradigm that aligns streaming with batch processing without re-encoding. The group-streaming approach is model-agnostic, scalable, and generalizes across cross-lingual and cross-modal tasks (e.g., translation and ASR), outperforming existing streaming-specific methods while enabling seamless offline and online operation. The work includes extensive experiments, visual analyses of streaming attention, and open-source code, highlighting practical impact for real-time AI systems leveraging LLMs.

Abstract

Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.

Paper Structure

This paper contains 46 sections, 10 equations, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Two streaming paradigms of LLMs: (a) Batch-streaming simulates batch-processing, while interleaved-streaming encodes streaming data in arrival order. (a-1) Input-Attention Mismatch: Whether the source tokens can attend to the target tokens. (a-2) Output-Attention Mismatch: Whether the target tokens can attend to the new source token. (a-3) Position-ID Mismatch: Whether the position IDs reflect the actual token order. (b) Batch-streaming relies on (b-1) KV cache re-encoding and (b-2) position re-encoding to simulate batch-processing.
  • Figure 1: An ASR example for illustration of different paradigms for LLMs processing.
  • Figure 2: Framework of our Group-streaming LLMs. (Left) Positional grouping of source and target tokens in the streaming LLM, avoiding re-encoding. The group start ID $\phi$ is a hyperparameter. (Right) The attention mask matrix during the training ensures that target tokens can only attend to locally available inputs.
  • Figure 2: Illustration of our Group-streaming speech Large Language Model, where the group-streaming LLM and the streaming audio encoder are connected through an MLP projector.
  • Figure 3: An example of the attention distribution of target tokens, where the attention values of each target token are normalized to emphasize the relative focus. The sample is from IWSLT-17 En-Fr dataset.
  • ...and 8 more figures