Table of Contents
Fetching ...

Staircase Streaming for Low-Latency Multi-Agent Inference

Junlin Wang, Jue Wang, Zhen, Xu, Ben Athiwaratkun, Bhuwan Dhingra, Ce Zhang, James Zou

TL;DR

Staircase streaming tackles latency in multi-agent LLM inference by streaming tokens between proposers and the aggregator as soon as partial results are available, enabling a pipelined execution that shortens time-to-first-token (TTFT). The approach is formalized with TTFT_normal and TTFT_staircase and augmented by a prefix-caching variant to reduce prompt-token overhead. Empirical evaluation on Arena-Hard and AlpacaEval shows TTFT reductions up to 93% and up to 1.6x increases in tokens-per-second, with maintained or improved reasoning capability and scalability to larger models. This work enables practical low-latency, high-quality multi-agent inference for latency-sensitive tasks such as chat and real-time reasoning.

Abstract

Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture-of-Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi-agent inference can enhance response quality, it can significantly increase the time to first token (TTFT), posing a challenge for latency-sensitive applications and hurting user experience. To address this issue, we propose staircase streaming for low-latency multi-agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.

Staircase Streaming for Low-Latency Multi-Agent Inference

TL;DR

Staircase streaming tackles latency in multi-agent LLM inference by streaming tokens between proposers and the aggregator as soon as partial results are available, enabling a pipelined execution that shortens time-to-first-token (TTFT). The approach is formalized with TTFT_normal and TTFT_staircase and augmented by a prefix-caching variant to reduce prompt-token overhead. Empirical evaluation on Arena-Hard and AlpacaEval shows TTFT reductions up to 93% and up to 1.6x increases in tokens-per-second, with maintained or improved reasoning capability and scalability to larger models. This work enables practical low-latency, high-quality multi-agent inference for latency-sensitive tasks such as chat and real-time reasoning.

Abstract

Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture-of-Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi-agent inference can enhance response quality, it can significantly increase the time to first token (TTFT), posing a challenge for latency-sensitive applications and hurting user experience. To address this issue, we propose staircase streaming for low-latency multi-agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.

Paper Structure

This paper contains 33 sections, 1 equation, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of normal streaming and staircase streaming using an MoA with 3 proposers and 1 aggregator. (a) In normal streaming, each LLM generates a full response before proceeding to the next step, leading to a longer TTFT. (b) Staircase streaming reduces TTFT by initiating the next step once the first chunks of proposed responses are available, enabling parallel processing between the proposers and the aggregator.
  • Figure 2: Results on Larger LLMs. The 'Best Single' model is WizardLM 8x22B. For MoA, the proposers include Qwen1.5-72B-Chat, Qwen1.5-110B-Chat, Wizard 8x22B, Mixtral-8x22B-Instruct-v0.1, and Llama-3-70B-Instruct. The aggregator is Qwen1.5-110B-Chat. TTFT was evaluated using the Together serverless endpoint, so the results may vary depending on server load.
  • Figure 3: The impact of first chunk size on ArenaHard win rate and TTFT.