Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO
Haim Barad, Ekaterina Aidova, Yury Gorbachev
TL;DR
This work tackles the latency and efficiency challenges of generative AI by integrating speculative sampling with KV caching in a quantized OpenVINO workflow. It analyzes how to co-optimize model-based (quantization) and execution-based (KV caching, speculative sampling) strategies, including the management of separate caches for draft and target models. The authors provide a practical notebook and experimental evidence (notably with DollyV2) showing significant latency reductions when using a suitably smaller draft model (e.g., ratio of at least 10x) while preserving the target distribution. The approach offers a concrete, compatible pathway to faster, energy-efficient inference suitable for deployment scenarios requiring high throughput and controlled resource usage. Overall, the paper contributes methodological guidance, implementation artifacts, and empirical insights for deploying efficient generative pipelines with OpenVINO and HuggingFace Optimum.
Abstract
Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.
