Table of Contents
Fetching ...

Collaborative Speculative Inference for Efficient LLM Inference Serving

Luyao Gao, Jianchun Liu, Hongli Xu, Xichong Zhang, Yunming Liao, Liusheng Huang

TL;DR

This paper tackles the inefficiencies of speculative inference for large language models by decoupling speculative decoding from verification and enabling multi-node collaboration through adaptive routing and confidence-aware token fusion. The CoSine system uses a star-topology speculation cluster and a verification server, coordinated by a collaborative pipeline that employs a linear-programming-based batch scheduler to balance draft generation and verification in real time. Key contributions include adaptive routing to domain-specialized drafters, confidence-based fusion across multiple drafts, and dynamic pipeline orchestration, leading to substantial reductions in latency and increases in throughput under equivalent resource costs. The approach demonstrates strong performance gains across offline and online serving while offering robust operation under varying workloads and task domains, implying practical impact for scalable, cost-efficient LLM inference serving.

Abstract

Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.

Collaborative Speculative Inference for Efficient LLM Inference Serving

TL;DR

This paper tackles the inefficiencies of speculative inference for large language models by decoupling speculative decoding from verification and enabling multi-node collaboration through adaptive routing and confidence-aware token fusion. The CoSine system uses a star-topology speculation cluster and a verification server, coordinated by a collaborative pipeline that employs a linear-programming-based batch scheduler to balance draft generation and verification in real time. Key contributions include adaptive routing to domain-specialized drafters, confidence-based fusion across multiple drafts, and dynamic pipeline orchestration, leading to substantial reductions in latency and increases in throughput under equivalent resource costs. The approach demonstrates strong performance gains across offline and online serving while offering robust operation under varying workloads and task domains, implying practical impact for scalable, cost-efficient LLM inference serving.

Abstract

Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
Paper Structure (18 sections, 8 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 18 sections, 8 equations, 10 figures, 3 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of speculative inference with verification and speculative decoding phases. Besides, we present the LLM ensemble with pre-inference and during-inference.
  • Figure 2: Performance bounds across model architectures and drafting configurations.
  • Figure 3: Model capabilities and token confidence in draft generation.
  • Figure 4: Overview of the CoSine architecture and workflow.
  • Figure 5: The token fusion process of draft generation in the speculation cluster.
  • ...and 5 more figures