Table of Contents
Fetching ...

Plato: Plan to Efficiently Decode for Large Language Model Inference

Shuowei Jin, Xueshen Liu, Yongji Wu, Haizhong Zheng, Qingzhao Zhang, Atul Prakash, Matthew Lentz, Danyang Zhuo, Feng Qian, Z. Morley Mao

TL;DR

Plato tackles LLM inference efficiency by combining semantic-space parallelism with an algorithm-system co-design. It uses a dependency-aware planning stage to build a DAG of sub-tasks and enables parallel decoding of independent nodes while preserving coherence, augmented by a pipelined workflow and a global KV context cache. Empirical results on Vicuna and WizardLM show Plato outperforms autoregressive baselines by up to 68% throughput and achieves 40% net quality gains, and substantially surpasses Skeleton-of-Thought in quality by 90% in net-win terms. Ablation studies confirm pipeline overlap and KV-cache reuse contribute 29% speedup and 75% reduction in prefill overhead, respectively.

Abstract

Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought (SoT) decompose prompts into sub-problems for concurrent processing. However, these methods significantly compromise answer quality by treating semantically linked sub-problems as independent. We propose Plato, a novel approach that co-designs algorithms and systems for semantic-aware parallel decoding. Plato leverages LLMs to organize sub-problems into a dependency graph based on logical and causal relationships, enabling concurrent decoding of non-dependent nodes while preserving answer coherence and quality. To further enhance efficiency, Plato pipelines planning and node decoding stages, implements a global context cache, and carefully structures node inference prompts to maximize key-value cache reuse and minimize overhead. Our evaluations show that Plato improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality. Compared to SoT, Plato demonstrates a remarkable 90% quality net-win rate. Ablation studies reveal that our pipeline design improves speedup by 29%, while our KV cache reuse optimization reduces overhead by 75%.

Plato: Plan to Efficiently Decode for Large Language Model Inference

TL;DR

Plato tackles LLM inference efficiency by combining semantic-space parallelism with an algorithm-system co-design. It uses a dependency-aware planning stage to build a DAG of sub-tasks and enables parallel decoding of independent nodes while preserving coherence, augmented by a pipelined workflow and a global KV context cache. Empirical results on Vicuna and WizardLM show Plato outperforms autoregressive baselines by up to 68% throughput and achieves 40% net quality gains, and substantially surpasses Skeleton-of-Thought in quality by 90% in net-win terms. Ablation studies confirm pipeline overlap and KV-cache reuse contribute 29% speedup and 75% reduction in prefill overhead, respectively.

Abstract

Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought (SoT) decompose prompts into sub-problems for concurrent processing. However, these methods significantly compromise answer quality by treating semantically linked sub-problems as independent. We propose Plato, a novel approach that co-designs algorithms and systems for semantic-aware parallel decoding. Plato leverages LLMs to organize sub-problems into a dependency graph based on logical and causal relationships, enabling concurrent decoding of non-dependent nodes while preserving answer coherence and quality. To further enhance efficiency, Plato pipelines planning and node decoding stages, implements a global context cache, and carefully structures node inference prompts to maximize key-value cache reuse and minimize overhead. Our evaluations show that Plato improves throughput by 68% over autoregressive decoding while achieving a 40% net win rate in answer quality. Compared to SoT, Plato demonstrates a remarkable 90% quality net-win rate. Ablation studies reveal that our pipeline design improves speedup by 29%, while our KV cache reuse optimization reduces overhead by 75%.
Paper Structure (44 sections, 6 figures, 8 tables)

This paper contains 44 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An example to demonstrate the difference between SoT and Plato.
  • Figure 2: Plato begins with a planning phase where the LLM decomposes the original question into nodes (sub-problems) with their logical dependencies. As each node is generated, it enters a waiting queue. Nodes become eligible for inference when all their dependencies have been satisfied. For example, in the figure, Node 1 and Node 2 have no dependencies, so they are immediately launched for inference upon generation. Node 3, however, depends on both Node 1 and Node 2, so it must remain in the waiting queue until both dependency nodes complete their inference. This dependency-aware scheduling ensures generation quality while maximizing parallel execution opportunities.
  • Figure 3: [Overall Evaluation]: Answer quality and speed-up compared to normal autoregressive generation (AR) on Vicuna.
  • Figure 4: [Overall Evaluation]: Quality of answers across all models between different methods on Vicuna.
  • Figure 5: Plan generation prompt of Plato.
  • ...and 1 more figures