PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction
Shufan Li, Aditya Grover
TL;DR
PredGen introduces input-time speculation to dramatically reduce latency in real-time LLM-driven voice chats while preserving the modular cascade architecture. By using iterative verification and predictive generation (AR and Jacobi with CLLM) alongside preemptive TTS, PredGen cuts the time to start audio output, achieving around a 2x average latency reduction across diverse benchmarks with controllable quality trade-offs. The approach is demonstrated on on-device hardware with multiple base models, and generalizes to several 7B+ LLMs, showing strong practical potential for privacy-conscious, single-user deployments. Overall, PredGen expands the feasibility of low-latency cascaded voice systems by efficiently utilizing idle input-time compute and providing configurable strategies to balance latency versus sample quality.
Abstract
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.
