Table of Contents
Fetching ...

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Shufan Li, Aditya Grover

TL;DR

PredGen introduces input-time speculation to dramatically reduce latency in real-time LLM-driven voice chats while preserving the modular cascade architecture. By using iterative verification and predictive generation (AR and Jacobi with CLLM) alongside preemptive TTS, PredGen cuts the time to start audio output, achieving around a 2x average latency reduction across diverse benchmarks with controllable quality trade-offs. The approach is demonstrated on on-device hardware with multiple base models, and generalizes to several 7B+ LLMs, showing strong practical potential for privacy-conscious, single-user deployments. Overall, PredGen expands the feasibility of low-latency cascaded voice systems by efficiently utilizing idle input-time compute and providing configurable strategies to balance latency versus sample quality.

Abstract

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

TL;DR

PredGen introduces input-time speculation to dramatically reduce latency in real-time LLM-driven voice chats while preserving the modular cascade architecture. By using iterative verification and predictive generation (AR and Jacobi with CLLM) alongside preemptive TTS, PredGen cuts the time to start audio output, achieving around a 2x average latency reduction across diverse benchmarks with controllable quality trade-offs. The approach is demonstrated on on-device hardware with multiple base models, and generalizes to several 7B+ LLMs, showing strong practical potential for privacy-conscious, single-user deployments. Overall, PredGen expands the feasibility of low-latency cascaded voice systems by efficiently utilizing idle input-time compute and providing configurable strategies to balance latency versus sample quality.

Abstract

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

Paper Structure

This paper contains 38 sections, 1 equation, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: The overall pipeline of PredGen.
  • Figure 2: Algorithm design of PredGen: (a) Upon receiving a partial prompt $P_{i+1}$, we use a verifier to accept $k_i$ tokens from $R_i$. We then create an updated response $R_{i+1}$ and synthesize the audio of its first sentence $A_{i+1}$. (b) We illustrate the Jacobi decoding process and compare it with AR decoding. Jacobi decoding is more efficient in this particular example.
  • Figure 3: Additional Experiments: (a) We report audio latency of four different LLMs around 8B scale on Lmsys dataset. (b) We report the NFETFS on MT-bench for each round of the conversation.
  • Figure 4: Effect of Top-K Acceptance