Table of Contents
Fetching ...

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang

TL;DR

This work identifies a critical bottleneck in real-time LLM token streaming: under unstable networks, packet losses can cause stalls that block rendering of subsequent tokens. It introduces Eloquent, a loss-resilient transmission scheme that places unacked tokens into packets containing newly generated tokens so each received packet can be rendered independently, reducing retransmission-induced stalls. Through end-to-end simulations and real-trace driven experiments, Eloquent achieves substantial stall reductions—up to 71.0% compared with TCP retransmission and 31.6% compared with naive duplication—while maintaining comparable data overhead. The results suggest that applying Eloquent within UDP-based transports and coupling it with QoE-aware strategies can markedly improve user experience for LLM chatbots in lossy networks, motivating practical deployment and further research into cross-layer design and client-server signaling.

Abstract

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.

Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

TL;DR

This work identifies a critical bottleneck in real-time LLM token streaming: under unstable networks, packet losses can cause stalls that block rendering of subsequent tokens. It introduces Eloquent, a loss-resilient transmission scheme that places unacked tokens into packets containing newly generated tokens so each received packet can be rendered independently, reducing retransmission-induced stalls. Through end-to-end simulations and real-trace driven experiments, Eloquent achieves substantial stall reductions—up to 71.0% compared with TCP retransmission and 31.6% compared with naive duplication—while maintaining comparable data overhead. The results suggest that applying Eloquent within UDP-based transports and coupling it with QoE-aware strategies can markedly improve user experience for LLM chatbots in lossy networks, motivating practical deployment and further research into cross-layer design and client-server signaling.

Abstract

To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.
Paper Structure (20 sections, 1 equation, 7 figures, 2 algorithms)

This paper contains 20 sections, 1 equation, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: LLM Chatbot Token Streaming Pipeline. We aim to improve the transmission part (Blue Starred 5a, 7a)
  • Figure 2: An illustrative example. When the packets containing the first two tokens are lost, Eloquent significantly reduces stall with the similar overall sending rate, compared to TCP or duplicating each packet twice. We use #1, #2, #3 to represent the newly generated tokens in later columns and only show ACKs that could signal retransmission.
  • Figure 3: Our Network Measurement Testbed
  • Figure 4: Packet arrival and token rendering of one measured session. Retransmission packets blocking rendering are in circled in red. Blocked packets circled in blue.
  • Figure 5: Eloquent reduces stall ratio under 15% loss rate. Duplication-X are with rate 2x, 3x, 4x, 5x from left to right.
  • ...and 2 more figures