Table of Contents
Fetching ...

From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen

TL;DR

A unified definition of streaming LLMs based on data flow and dynamic interaction is established to clarify existing ambiguities and a systematic taxonomy of current streaming LLMs is proposed.

Abstract

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

TL;DR

A unified definition of streaming LLMs based on data flow and dynamic interaction is established to clarify existing ambiguities and a systematic taxonomy of current streaming LLMs is proposed.

Abstract

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.
Paper Structure (44 sections, 1 equation, 5 figures, 7 tables)

This paper contains 44 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of three types of streaming large language models (LLMs). (Left) Output-streaming LLM performs streaming generation after static reading. (Middle) Sequential-streaming LLM performs streaming generation after streaming reading. (Right) Concurrent-streaming LLM performs streaming generation while streaming reading.
  • Figure 2: Overview of streaming LLM paradigms and their key challenges. The figure contrasts Output-streaming, Sequential-streaming, and Concurrent-streaming LLMs, highlighting their core goals and corresponding research components. Concurrent-streaming builds on the first two and adds extra challenges in real-time streaming architecture adaptation and interaction policy learning.
  • Figure 3: Taxonomy of Streaming Large Language Models.
  • Figure 4: Illustration of structural conflicts when adapting batch-oriented LLMs (left) to concurrent streaming (right), where indicates the token generation direction, denotes attention dependencies, blocks represent the input, and blocks represent the output. (1) Attention contention: Ambiguous causal dependency between the newly inserted streaming input and historical outputs. (2) Position-ID conflict: The new streaming input and generated output compete for the identical position ID.
  • Figure 5: Illustration of interaction decision in concurrent streaming LLMs, where the model learns to dynamically schedule reading inputs and emitting outputs.