Plato: Plan to Efficiently Decode for Large Language Model Inference
Shuowei Jin, Xueshen Liu, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Atul Prakash, Matthew Lentz, Danyang Zhuo, Feng Qian, Z. Morley Mao
TL;DR
Plato tackles LLM inference efficiency by combining semantic-space parallelism with an algorithm-system co-design. It uses a dependency-aware planning stage to build a DAG of sub-tasks and enables parallel decoding of independent nodes while preserving coherence, augmented by a pipelined workflow and a global KV context cache. Empirical results on Vicuna and WizardLM show Plato outperforms autoregressive baselines by up to 68% throughput and achieves 40% net quality gains, and substantially surpasses Skeleton-of-Thought in quality by 90% in net-win terms. Ablation studies confirm pipeline overlap and KV-cache reuse contribute 29% speedup and 75% reduction in prefill overhead, respectively.
Abstract
Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought (SoT) decompose prompts into sub-problems for concurrent processing. However, these methods significantly compromise answer quality by treating semantically linked sub-problems as independent. We propose Plato, a novel approach that co-designs algorithms and systems for semantic-aware parallel decoding. Plato leverages LLMs to organize sub-problems into a dependency graph based on logical and causal relationships, enabling concurrent decoding of non-dependent nodes while preserving answer coherence and quality. To further enhance efficiency, Plato pipelines planning and node decoding stages, implements a global context cache, and carefully structures node inference prompts to maximize key-value cache reuse and minimize overhead. Our evaluations show that Plato improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality. Compared to SoT, Plato demonstrates a remarkable 90% quality net-win rate. Ablation studies reveal that our pipeline design improves speedup by 29%, while our KV cache reuse optimization reduces overhead by 75%.
