Table of Contents
Fetching ...

Accelerating LLM Inference with Precomputed Query Storage

Jay H. Park, Youngju Cho, Choungsol Lee, Moonwook Oh, Euiseong Seo

TL;DR

This work addresses the latency and compute costs of LLM inference in resource-constrained settings by proposing StorInfer, a storage-assisted system that precomputes and stores query–response pairs offline. An offline Generator builds diverse, deduplicated queries using adaptive masking and adaptive sampling, storing results in a disk-backed vector index for fast similarity retrieval. Online, StorInfer retrieves precomputed matches to bypass GPU inference and falls back to live inference only when no suitable match exists, achieving up to 17.3% end-to-end latency reduction without sacrificing quality on QA benchmarks. The results demonstrate that storage-based precomputation can scale with storage capacity to materially reduce latency and compute needs, making low-latency LLM deployment feasible on edge devices.

Abstract

Large language model (LLM) inference often suffers from high latency, particularly in resource-constrained environments such as on-device or edge deployments. To address this challenge, we present StorInfer, a novel storage-assisted LLM inference system that accelerates response time by precomputing and storing predictable query-response pairs offline. When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response, significantly reducing latency and compute costs. To maximize coverage and effectiveness, StorInfer employs an LLM-driven generator that adaptively produces diverse and deduplicated queries based on a given knowledge base. This is achieved via two techniques: adaptive query masking, which prevents regeneration of similar queries, and adaptive sampling, which dynamically tunes generation parameters to promote semantic diversity. The resulting query-response pairs are embedded and indexed using a disk-backed vector database to enable fast, similarity-based retrieval at runtime. Using this approach, we generated 150K unique precomputed pairs (taking up to 830 MB of storage space), achieving up to 17.3% latency reduction with no loss in response quality. Our evaluation across multiple QA datasets demonstrates the practicality and scalability of storage-assisted inference, especially in scenarios with predictable query distributions. StorInfer highlights a promising direction in leveraging storage as a primary enabler for efficient, low-latency LLM deployment.

Accelerating LLM Inference with Precomputed Query Storage

TL;DR

This work addresses the latency and compute costs of LLM inference in resource-constrained settings by proposing StorInfer, a storage-assisted system that precomputes and stores query–response pairs offline. An offline Generator builds diverse, deduplicated queries using adaptive masking and adaptive sampling, storing results in a disk-backed vector index for fast similarity retrieval. Online, StorInfer retrieves precomputed matches to bypass GPU inference and falls back to live inference only when no suitable match exists, achieving up to 17.3% end-to-end latency reduction without sacrificing quality on QA benchmarks. The results demonstrate that storage-based precomputation can scale with storage capacity to materially reduce latency and compute needs, making low-latency LLM deployment feasible on edge devices.

Abstract

Large language model (LLM) inference often suffers from high latency, particularly in resource-constrained environments such as on-device or edge deployments. To address this challenge, we present StorInfer, a novel storage-assisted LLM inference system that accelerates response time by precomputing and storing predictable query-response pairs offline. When a user query semantically matches a precomputed query, StorInfer bypasses expensive GPU inference and instantly returns the stored response, significantly reducing latency and compute costs. To maximize coverage and effectiveness, StorInfer employs an LLM-driven generator that adaptively produces diverse and deduplicated queries based on a given knowledge base. This is achieved via two techniques: adaptive query masking, which prevents regeneration of similar queries, and adaptive sampling, which dynamically tunes generation parameters to promote semantic diversity. The resulting query-response pairs are embedded and indexed using a disk-backed vector database to enable fast, similarity-based retrieval at runtime. Using this approach, we generated 150K unique precomputed pairs (taking up to 830 MB of storage space), achieving up to 17.3% latency reduction with no loss in response quality. Our evaluation across multiple QA datasets demonstrates the practicality and scalability of storage-assisted inference, especially in scenarios with predictable query distributions. StorInfer highlights a promising direction in leveraging storage as a primary enabler for efficient, low-latency LLM deployment.

Paper Structure

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: StorInfer system architecture.
  • Figure 2: StorInfer Runtime executes vector search and LLM inference in parallel, sending a termination signal upon a query hit.
  • Figure 3: Response latency of traditional LLM inference vs. vector search in StorInfer across different datasets.
  • Figure 4: Hit rate and storage usage with increasing number of precomputed queries on the SQuAD dataset.