LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Junlong Tong; Jinlan Fu; Zixuan Lin; Yingqi Fan; Anhao Zhao; Hui Su; Xiaoyu Shen

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen

TL;DR

This work tackles the challenge of applying batch-trained LLMs to streaming scenarios by identifying three key mismatches between batch and streaming processing: input-attention, output-attention, and position-ID. Through stepwise ablations on streaming translation tasks, the authors demonstrate that input-attention mismatch is the primary driver of performance loss, while output-attention and position-ID mismatches contribute negligibly. They further analyze the role of position encoding, showing that preserving relative order within source and target contexts matters more than maintaining absolute order, and propose a group position encoding paradigm that aligns streaming with batch processing without re-encoding. The group-streaming approach is model-agnostic, scalable, and generalizes across cross-lingual and cross-modal tasks (e.g., translation and ASR), outperforming existing streaming-specific methods while enabling seamless offline and online operation. The work includes extensive experiments, visual analyses of streaming attention, and open-source code, highlighting practical impact for real-time AI systems leveraging LLMs.

Abstract

Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

TL;DR

Abstract

LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)